Neural retrievers prefer LLM-generated text — datasets, causes, and a plug-in fix

October 31, 20238 min

Overview

Decision SnapshotReady For Pilot

Experiments use two public benchmarks extended with LLM rewrites, multiple retrievers and re-rankers, and human checks; results are clear on these datasets but broader web-scale effects are not proven.

Citations5

Evidence Strength0.80

Confidence0.82

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, Jun Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If search or recommendation systems prefer LLM-generated content, human creators may lose visibility and ranking can be manipulated; businesses must audit source bias to protect content quality and trust.

Who Should Care

Summary TLDR

The paper shows that modern neural retrieval models (dense retrievers and neural re-rankers) systematically prefer documents written or rewritten by LLMs over human-written text, even when both carry the same meaning. The authors build two new mixed-source benchmarks (SciFact+AIGC, NQ320K+AIGC) by prompting LLMs to rewrite human documents, diagnose the bias (called source bias) with embedding, SVD and perplexity analyses, and propose a simple plug-in debiased training constraint that reduces the bias by penalizing higher scores on LLM-generated documents.

Problem Statement

As the web fills with LLM-generated text, search engines now index both human and LLM text. The core question: do retrieval models favor LLM-generated content over human-written content even when semantics match, and why? The paper builds test data and measures this "source bias," investigates causes, and tests a mitigation.

Main Contribution

Constructed two mixed-source retrieval benchmarks (SciFact+AIGC, NQ320K+AIGC) by rewriting human documents with Llama2 and ChatGPT.

Discovered and measured "source bias": PLM-based neural retrievers and neural re-rankers rank LLM-generated documents higher than equivalent human-written ones.

Key Findings

Neural retrievers prefer LLM-generated documents over semantically equivalent human text.

NumbersANCE Relative Δ NDCG@1 = -47.0% (SciFact+AIGC), Contriever Relative Δ NDCG@1 = -25.5%

Practical UseExpect dense retrievers to push LLM-written pages higher in search results; audit rankings if you care about source fairness.

Evidence RefTable 4 (SciFact+AIGC mixed-corpus results)

Bias is stronger in second-stage neural re-rankers than first-stage retrieval.

NumbersmonoT5 re-ranker NDCG@1 Relative Δ = -67.3% (favoring LLM-generated)

Practical UseRe-ranking layers can amplify source bias; fix point-injection should include re-rankers, not just first-stage retrievers.

Evidence RefTable 6 (re-ranker results on SciFact+AIGC)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size (SciFact+AIGC)5,183 documents per sourceSciFact+AIGCTable 1 reports 5,183 human-written and 5,183 LLM-generated docs.Table 1
Dataset size (NQ320K+AIGC)109,739 human-written documentsNQ320K+AIGCTable 1 shows 109,739 human-written docs; Llama2-generated counterpart also created.Table 1

What To Try In 7 Days

Generate small rewrites of your corpus (use Llama2/ChatGPT) and run your retriever to compare rankings by source.

Measure Relative Δ (NDCG@1) per source and flag large negative values (favoring LLM).

Train a small re-ranker with the paper's debiased penalty and tune coefficient α to see if bias reduces without losing relevance.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Synthetic LLM-generated texts come from rewriting prompts, which may differ from organic AIGC on the web.

Experiments focus on two seed datasets (SciFact and NQ320K); generalization to other domains is untested.

When Not To Use

When the application intentionally prefers LLM-generated summaries or synthesized content over original sources.

If you lack paired human-vs-LLM examples for training the debiased constraint.

Failure Modes

Over-penalizing can lower relevance for LLM content that is legitimately higher quality.

Adversarial prompt styles may evade detection and maintain bias.

Core Entities

Models

TF-IDFBM25ANCEBERMTAS-BContrieverMiniLMmonoT5BERTLlama2ChatGPTOpenAI text-embedding-ada-002

Metrics

NDCG@KMAP@KPerplexityCosine similarity (embeddings)Jaccard term overlapSingular values (SVD)

Datasets

SciFact+AIGCNQ320K+AIGCSciFactNQ320K

Benchmarks

SciFact+AIGCNQ320K+AIGC