Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
5
Why It Matters For Business
If search or recommendation systems prefer LLM-generated content, human creators may lose visibility and ranking can be manipulated; businesses must audit source bias to protect content quality and trust.
Summary TLDR
The paper shows that modern neural retrieval models (dense retrievers and neural re-rankers) systematically prefer documents written or rewritten by LLMs over human-written text, even when both carry the same meaning. The authors build two new mixed-source benchmarks (SciFact+AIGC, NQ320K+AIGC) by prompting LLMs to rewrite human documents, diagnose the bias (called source bias) with embedding, SVD and perplexity analyses, and propose a simple plug-in debiased training constraint that reduces the bias by penalizing higher scores on LLM-generated documents.
Problem Statement
As the web fills with LLM-generated text, search engines now index both human and LLM text. The core question: do retrieval models favor LLM-generated content over human-written content even when semantics match, and why? The paper builds test data and measures this "source bias," investigates causes, and tests a mitigation.
Main Contribution
Constructed two mixed-source retrieval benchmarks (SciFact+AIGC, NQ320K+AIGC) by rewriting human documents with Llama2 and ChatGPT.
Discovered and measured "source bias": PLM-based neural retrievers and neural re-rankers rank LLM-generated documents higher than equivalent human-written ones.
Analyzed causes via embeddings, SVD (topic concentration), and PLM perplexity; showed LLM text is more focused and lower perplexity to PLMs.
Proposed a simple plug-in debiased constraint for training that penalizes when an LLM-generated document scores higher than its human counterpart.
Key Findings
Neural retrievers prefer LLM-generated documents over semantically equivalent human text.
Bias is stronger in second-stage neural re-rankers than first-stage retrieval.
LLM-generated texts are more semantically concentrated and easier for PLMs to model.
A plug-in debiased penalty reduces source bias while keeping or improving human-only retrieval performance.
Results
Dataset size (SciFact+AIGC)
Dataset size (NQ320K+AIGC)
Neural retriever bias (example)
Re-ranker amplification (example)
Embedding semantic match
Perplexity (PLM) comparison
Who Should Care
What To Try In 7 Days
Generate small rewrites of your corpus (use Llama2/ChatGPT) and run your retriever to compare rankings by source.
Measure Relative Δ (NDCG@1) per source and flag large negative values (favoring LLM).
Train a small re-ranker with the paper's debiased penalty and tune coefficient α to see if bias reduces without losing relevance.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Synthetic LLM-generated texts come from rewriting prompts, which may differ from organic AIGC on the web.
- Experiments focus on two seed datasets (SciFact and NQ320K); generalization to other domains is untested.
- Debiasing tested on a few retrievers; effects on large production stacks or downstream user satisfaction need further validation.
When Not To Use
- When the application intentionally prefers LLM-generated summaries or synthesized content over original sources.
- If you lack paired human-vs-LLM examples for training the debiased constraint.
Failure Modes
- Over-penalizing can lower relevance for LLM content that is legitimately higher quality.
- Adversarial prompt styles may evade detection and maintain bias.
- Debiasing depends on having matched human/LLM pairs; noisy pairs reduce effectiveness.
Core Entities
Models
- TF-IDF
- BM25
- ANCE
- BERM
- TAS-B
- Contriever
- MiniLM
- monoT5
- BERT
- Llama2
- ChatGPT
- OpenAI text-embedding-ada-002
Metrics
- NDCG@K
- MAP@K
- Perplexity
- Cosine similarity (embeddings)
- Jaccard term overlap
- Singular values (SVD)
Datasets
- SciFact+AIGC
- NQ320K+AIGC
- SciFact
- NQ320K
Benchmarks
- SciFact+AIGC
- NQ320K+AIGC

