Pretraining memory and corpus-frequency biases drive much of LLM hallucination on inference

May 23, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.2

Citation Count

18

Authors

Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, Mark Steedman

Links

Abstract / PDF

Why It Matters For Business

LLMs can assert conclusions drawn from their training data or corpus statistics rather than the given context. That puts QA, summarization, and policy extraction at risk of silent misinformation; apply attestation checks and bias-controlled tests before deployment.

Summary TLDR

This paper runs controlled prompting tests on LLaMA-65B, GPT-3.5 (text-davinci-003), and PaLM-540B to find two concrete sources of false positive 'entailment' (hallucination) in inference tasks. First, LLMs tend to assert conclusions when the hypothesis sentence appears in their training data (attestation bias). Second, they favor entailment when the hypothesis expresses a more frequent predicate than the premise (relative frequency bias). Both biases come from pretraining statistics and cause big drops in reliable inference when test examples are designed against them.

Problem Statement

LLMs are trusted for inference tasks (e.g., question answering, summarization), but they sometimes hallucinate by asserting conclusions not supported by provided premises. The paper asks: which pretraining-derived biases cause these false positives, and how much do they harm real NLI performance?

Main Contribution

Show and measure an attestation bias: models more often predict 'Entail' when the hypothesis matches text the model likely saw in pretraining.

Show and measure a relative frequency bias: models favor entailment if the hypothesis predicate is more corpus-frequent than the premise predicate.

Quantify impact: construct adversarial subsets and show large drops in discriminative performance (AUC norm) when labels conflict with these biases.

Key Findings

Attestation (memorized sentence) strongly raises false positive entailments.

NumbersFalse Entail chance 1.9x (LLaMA), 2.2x (GPT-3.5), 2.0x (PaLM)

Relative term-frequency of predicates biases entailment decisions.

NumbersFalse Entail chance 1.6x (LLaMA), 1.8x (GPT-3.5), 2.0x (PaLM)

Bias-aligned vs. bias-adversarial examples cause large performance swings.

NumbersAUC norm drop ~47–74 points for attestation-adversarial subsets; ~10 point drop for frequency-adversarial

Results

Attestation bias multiplicative effect

Value1.9x (LLaMA), 2.2x (GPT-3.5), 2.0x (PaLM)

BaselineWhen hypothesis not attested

Relative frequency bias multiplicative effect

Value1.6x (LLaMA), 1.8x (GPT-3.5), 2.0x (PaLM)

BaselineWhen premise frequency ≥ hypothesis frequency

AUC norm drop when attestation bias is adversarial

Value47–74 points drop (model-dependent)

BaselineBias-consistent subset AUC norm

Who Should Care

What To Try In 7 Days

Add an attestation probe: ask the model whether the hypothesis is 'attested/unknown/false' before trusting outputs.

Run bias-controlled splits: evaluate models on examples adversarial to attestation and frequency biases.

Mask or canonicalize named entities in a staging test to see how much outputs rely on memorized entities.

Agent Features

Memory

  • Propositional memory (sentence-level memorization)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Paper tests two biases but does not claim to cover all hallucination sources.
  • Google N-grams is a proxy for pretraining frequency and may not match private pretraining corpora exactly.
  • Prompting choices and few-shot examples are fixed; other prompts might change magnitudes but core trends persist.

When Not To Use

  • Do not rely solely on model outputs for high-stakes inference tasks without bias controls.
  • Avoid using raw LLM predictions for knowledge extraction where user-provided context must be the only source of truth.

Failure Modes

  • Model asserts hypothesis because it matches memorized training sentences, not because the premise supports it.
  • Named entities act as memory indices, causing over-reliance on entity identity instead of predicate logic.
  • Models prefer entailment when hypothesis predicates are more corpus-frequent, producing systematic false affirmations.

Core Entities

Models

  • LLaMA-65B
  • GPT-3.5 (text-davinci-003)
  • PaLM-540B
  • GPT-4 (analysis in Appendix F)

Metrics

  • AUC norm
  • Precision
  • Recall
  • F1
  • Folded Entail probability estimates

Datasets

  • Levy/Holt (directional NLI)
  • RTE-1
  • Google N-grams (1950-2019)
  • NewsCrawl

Benchmarks

  • Natural Language Inference (NLI) directional subset

Context Entities

Models

  • Alpaca
  • Vicuna
  • OPT, GPT-J (omitted)

Datasets

  • MMLU (excluded)
  • Natural Questions (excluded)
  • OpenBookQA (excluded)