Neural retrievers prefer LLM-generated text — datasets, causes, and a plug-in fix

Overview

Decision SnapshotReady For Pilot

Experiments use two public benchmarks extended with LLM rewrites, multiple retrievers and re-rankers, and human checks; results are clear on these datasets but broader web-scale effects are not proven.

Citations5

Evidence Strength0.80

Confidence0.82

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, Jun Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If search or recommendation systems prefer LLM-generated content, human creators may lose visibility and ranking can be manipulated; businesses must audit source bias to protect content quality and trust.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

The paper shows that modern neural retrieval models (dense retrievers and neural re-rankers) systematically prefer documents written or rewritten by LLMs over human-written text, even when both carry the same meaning. The authors build two new mixed-source benchmarks (SciFact+AIGC, NQ320K+AIGC) by prompting LLMs to rewrite human documents, diagnose the bias (called source bias) with embedding, SVD and perplexity analyses, and propose a simple plug-in debiased training constraint that reduces the bias by penalizing higher scores on LLM-generated documents.

Problem Statement

As the web fills with LLM-generated text, search engines now index both human and LLM text. The core question: do retrieval models favor LLM-generated content over human-written content even when semantics match, and why? The paper builds test data and measures this "source bias," investigates causes, and tests a mitigation.

Main Contribution

Constructed two mixed-source retrieval benchmarks (SciFact+AIGC, NQ320K+AIGC) by rewriting human documents with Llama2 and ChatGPT.

Discovered and measured "source bias": PLM-based neural retrievers and neural re-rankers rank LLM-generated documents higher than equivalent human-written ones.

Key Findings

Neural retrievers prefer LLM-generated documents over semantically equivalent human text.

NumbersANCE Relative Δ NDCG@1 = -47.0% (SciFact+AIGC), Contriever Relative Δ NDCG@1 = -25.5%

Practical UseExpect dense retrievers to push LLM-written pages higher in search results; audit rankings if you care about source fairness.

Evidence RefTable 4 (SciFact+AIGC mixed-corpus results)

Bias is stronger in second-stage neural re-rankers than first-stage retrieval.

NumbersmonoT5 re-ranker NDCG@1 Relative Δ = -67.3% (favoring LLM-generated)

Practical UseRe-ranking layers can amplify source bias; fix point-injection should include re-rankers, not just first-stage retrievers.

Evidence RefTable 6 (re-ranker results on SciFact+AIGC)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size (SciFact+AIGC)	5,183 documents per source	—	—	SciFact+AIGC	Table 1 reports 5,183 human-written and 5,183 LLM-generated docs.	Table 1
Dataset size (NQ320K+AIGC)	109,739 human-written documents	—	—	NQ320K+AIGC	Table 1 shows 109,739 human-written docs; Llama2-generated counterpart also created.	Table 1

What To Try In 7 Days

Generate small rewrites of your corpus (use Llama2/ChatGPT) and run your retriever to compare rankings by source.

Measure Relative Δ (NDCG@1) per source and flag large negative values (favoring LLM).

Train a small re-ranker with the paper's debiased penalty and tune coefficient α to see if bias reduces without losing relevance.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/KID-22/Source-Bias

Data URLs

https://github.com/KID-22/Source-Bias

Risks & Boundaries

Limitations

Synthetic LLM-generated texts come from rewriting prompts, which may differ from organic AIGC on the web.

Experiments focus on two seed datasets (SciFact and NQ320K); generalization to other domains is untested.

When Not To Use

When the application intentionally prefers LLM-generated summaries or synthesized content over original sources.

If you lack paired human-vs-LLM examples for training the debiased constraint.

Failure Modes

Over-penalizing can lower relevance for LLM content that is legitimately higher quality.

Adversarial prompt styles may evade detection and maintain bias.

Core Entities

Models

TF-IDFBM25ANCEBERMTAS-BContrieverMiniLMmonoT5BERTLlama2ChatGPTOpenAI text-embedding-ada-002

Metrics

NDCG@KMAP@KPerplexityCosine similarity (embeddings)Jaccard term overlapSingular values (SVD)

Datasets

SciFact+AIGCNQ320K+AIGCSciFactNQ320K

Benchmarks

SciFact+AIGCNQ320K+AIGC

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Neural retrievers prefer LLM-generated documents over semantically equivalent human text.

Bias is stronger in second-stage neural re-rankers than first-stage retrieval.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding