Overview
Experiments use two public benchmarks extended with LLM rewrites, multiple retrievers and re-rankers, and human checks; results are clear on these datasets but broader web-scale effects are not proven.
Citations5
Evidence Strength0.80
Confidence0.82
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If search or recommendation systems prefer LLM-generated content, human creators may lose visibility and ranking can be manipulated; businesses must audit source bias to protect content quality and trust.
Who Should Care
Summary TLDR
The paper shows that modern neural retrieval models (dense retrievers and neural re-rankers) systematically prefer documents written or rewritten by LLMs over human-written text, even when both carry the same meaning. The authors build two new mixed-source benchmarks (SciFact+AIGC, NQ320K+AIGC) by prompting LLMs to rewrite human documents, diagnose the bias (called source bias) with embedding, SVD and perplexity analyses, and propose a simple plug-in debiased training constraint that reduces the bias by penalizing higher scores on LLM-generated documents.
Problem Statement
As the web fills with LLM-generated text, search engines now index both human and LLM text. The core question: do retrieval models favor LLM-generated content over human-written content even when semantics match, and why? The paper builds test data and measures this "source bias," investigates causes, and tests a mitigation.
Main Contribution
Constructed two mixed-source retrieval benchmarks (SciFact+AIGC, NQ320K+AIGC) by rewriting human documents with Llama2 and ChatGPT.
Discovered and measured "source bias": PLM-based neural retrievers and neural re-rankers rank LLM-generated documents higher than equivalent human-written ones.
Key Findings
Neural retrievers prefer LLM-generated documents over semantically equivalent human text.
Bias is stronger in second-stage neural re-rankers than first-stage retrieval.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size (SciFact+AIGC) | 5,183 documents per source | — | — | SciFact+AIGC | Table 1 reports 5,183 human-written and 5,183 LLM-generated docs. | Table 1 |
| Dataset size (NQ320K+AIGC) | 109,739 human-written documents | — | — | NQ320K+AIGC | Table 1 shows 109,739 human-written docs; Llama2-generated counterpart also created. | Table 1 |
What To Try In 7 Days
Generate small rewrites of your corpus (use Llama2/ChatGPT) and run your retriever to compare rankings by source.
Measure Relative Δ (NDCG@1) per source and flag large negative values (favoring LLM).
Train a small re-ranker with the paper's debiased penalty and tune coefficient α to see if bias reduces without losing relevance.
Reproducibility
Risks & Boundaries
Limitations
Synthetic LLM-generated texts come from rewriting prompts, which may differ from organic AIGC on the web.
Experiments focus on two seed datasets (SciFact and NQ320K); generalization to other domains is untested.
When Not To Use
When the application intentionally prefers LLM-generated summaries or synthesized content over original sources.
If you lack paired human-vs-LLM examples for training the debiased constraint.
Failure Modes
Over-penalizing can lower relevance for LLM content that is legitimately higher quality.
Adversarial prompt styles may evade detection and maintain bias.

