Neural retrievers prefer LLM-generated text — datasets, causes, and a plug-in fix

October 31, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

5

Authors

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, Jun Xu

Links

Abstract / PDF

Why It Matters For Business

If search or recommendation systems prefer LLM-generated content, human creators may lose visibility and ranking can be manipulated; businesses must audit source bias to protect content quality and trust.

Summary TLDR

The paper shows that modern neural retrieval models (dense retrievers and neural re-rankers) systematically prefer documents written or rewritten by LLMs over human-written text, even when both carry the same meaning. The authors build two new mixed-source benchmarks (SciFact+AIGC, NQ320K+AIGC) by prompting LLMs to rewrite human documents, diagnose the bias (called source bias) with embedding, SVD and perplexity analyses, and propose a simple plug-in debiased training constraint that reduces the bias by penalizing higher scores on LLM-generated documents.

Problem Statement

As the web fills with LLM-generated text, search engines now index both human and LLM text. The core question: do retrieval models favor LLM-generated content over human-written content even when semantics match, and why? The paper builds test data and measures this "source bias," investigates causes, and tests a mitigation.

Main Contribution

Constructed two mixed-source retrieval benchmarks (SciFact+AIGC, NQ320K+AIGC) by rewriting human documents with Llama2 and ChatGPT.

Discovered and measured "source bias": PLM-based neural retrievers and neural re-rankers rank LLM-generated documents higher than equivalent human-written ones.

Analyzed causes via embeddings, SVD (topic concentration), and PLM perplexity; showed LLM text is more focused and lower perplexity to PLMs.

Proposed a simple plug-in debiased constraint for training that penalizes when an LLM-generated document scores higher than its human counterpart.

Key Findings

Neural retrievers prefer LLM-generated documents over semantically equivalent human text.

NumbersANCE Relative Δ NDCG@1 = -47.0% (SciFact+AIGC), Contriever Relative Δ NDCG@1 = -25.5%

Bias is stronger in second-stage neural re-rankers than first-stage retrieval.

NumbersmonoT5 re-ranker NDCG@1 Relative Δ = -67.3% (favoring LLM-generated)

LLM-generated texts are more semantically concentrated and easier for PLMs to model.

NumbersEmbedding cosine similarity between LLM and original >0.95; LLM texts show larger top singular values and lower PLM-perp

A plug-in debiased penalty reduces source bias while keeping or improving human-only retrieval performance.

NumbersDebiased coefficient α shifts Relative Δ from negative to positive across metrics (plots in Figures 8–9)

Results

Dataset size (SciFact+AIGC)

Value5,183 documents per source

Dataset size (NQ320K+AIGC)

Value109,739 human-written documents

Neural retriever bias (example)

ValueANCE Relative Δ NDCG@1 = -47.0% (favoring LLM)

Baselinehuman-written target

Re-ranker amplification (example)

ValuemonoT5 NDCG@1 Relative Δ = -67.3% (favoring LLM)

Baselinehuman-written target

Embedding semantic match

Valuecosine similarity mostly >0.95 between rewritten and original

Perplexity (PLM) comparison

ValueLLM-generated texts have lower perplexity to BERT

Baselinehuman-written texts

Who Should Care

What To Try In 7 Days

Generate small rewrites of your corpus (use Llama2/ChatGPT) and run your retriever to compare rankings by source.

Measure Relative Δ (NDCG@1) per source and flag large negative values (favoring LLM).

Train a small re-ranker with the paper's debiased penalty and tune coefficient α to see if bias reduces without losing relevance.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Synthetic LLM-generated texts come from rewriting prompts, which may differ from organic AIGC on the web.
  • Experiments focus on two seed datasets (SciFact and NQ320K); generalization to other domains is untested.
  • Debiasing tested on a few retrievers; effects on large production stacks or downstream user satisfaction need further validation.

When Not To Use

  • When the application intentionally prefers LLM-generated summaries or synthesized content over original sources.
  • If you lack paired human-vs-LLM examples for training the debiased constraint.

Failure Modes

  • Over-penalizing can lower relevance for LLM content that is legitimately higher quality.
  • Adversarial prompt styles may evade detection and maintain bias.
  • Debiasing depends on having matched human/LLM pairs; noisy pairs reduce effectiveness.

Core Entities

Models

  • TF-IDF
  • BM25
  • ANCE
  • BERM
  • TAS-B
  • Contriever
  • MiniLM
  • monoT5
  • BERT
  • Llama2
  • ChatGPT
  • OpenAI text-embedding-ada-002

Metrics

  • NDCG@K
  • MAP@K
  • Perplexity
  • Cosine similarity (embeddings)
  • Jaccard term overlap
  • Singular values (SVD)

Datasets

  • SciFact+AIGC
  • NQ320K+AIGC
  • SciFact
  • NQ320K

Benchmarks

  • SciFact+AIGC
  • NQ320K+AIGC