Pretraining memory and corpus-frequency biases drive much of LLM hallucination on inference

Overview

Decision SnapshotNeeds Validation

The paper runs controlled behavioral tests across three major LLM families and shows consistent biases rooted in pretraining; numeric evidence and dataset-controlled splits back the claims.

Citations18

Evidence Strength0.90

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 50%

Authors

Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, Mark Steedman

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can assert conclusions drawn from their training data or corpus statistics rather than the given context. That puts QA, summarization, and policy extraction at risk of silent misinformation; apply attestation checks and bias-controlled tests before deployment.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper runs controlled prompting tests on LLaMA-65B, GPT-3.5 (text-davinci-003), and PaLM-540B to find two concrete sources of false positive 'entailment' (hallucination) in inference tasks. First, LLMs tend to assert conclusions when the hypothesis sentence appears in their training data (attestation bias). Second, they favor entailment when the hypothesis expresses a more frequent predicate than the premise (relative frequency bias). Both biases come from pretraining statistics and cause big drops in reliable inference when test examples are designed against them.

Problem Statement

LLMs are trusted for inference tasks (e.g., question answering, summarization), but they sometimes hallucinate by asserting conclusions not supported by provided premises. The paper asks: which pretraining-derived biases cause these false positives, and how much do they harm real NLI performance?

Main Contribution

Show and measure an attestation bias: models more often predict 'Entail' when the hypothesis matches text the model likely saw in pretraining.

Show and measure a relative frequency bias: models favor entailment if the hypothesis predicate is more corpus-frequent than the premise predicate.

Key Findings

Attestation (memorized sentence) strongly raises false positive entailments.

NumbersFalse Entail chance 1.9x (LLaMA), 2.2x (GPT-3.5), 2.0x (PaLM)

Practical UseIf a hypothesis appears in pretraining, the model may assert it regardless of the premise—check model attestation before trusting entailment outputs.

Evidence RefAbstract; §5; Fig.2

Relative term-frequency of predicates biases entailment decisions.

NumbersFalse Entail chance 1.6x (LLaMA), 1.8x (GPT-3.5), 2.0x (PaLM)

Practical UseWhen the hypothesis uses a more common predicate than the premise, expect the model to wrongly affirm entailment; controlling for frequency reduces this error source.

Evidence RefAbstract; §7; Fig.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Attestation bias multiplicative effect	1.9x (LLaMA), 2.2x (GPT-3.5), 2.0x (PaLM)	When hypothesis not attested	↑ false Entail	Levy/Holt random-premise (I_RandPrem)	Abstract; §5; Fig.2	§5
Relative frequency bias multiplicative effect	1.6x (LLaMA), 1.8x (GPT-3.5), 2.0x (PaLM)	When premise frequency ≥ hypothesis frequency	↑ false Entail	Levy/Holt I_GenArg_RandPrem	Abstract; §7; Fig.3	§7

What To Try In 7 Days

Add an attestation probe: ask the model whether the hypothesis is 'attested/unknown/false' before trusting outputs.

Run bias-controlled splits: evaluate models on examples adversarial to attestation and frequency biases.

Mask or canonicalize named entities in a staging test to see how much outputs rely on memorized entities.

Agent Features

Memory

Propositional memory (sentence-level memorization)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Teddy-Li/LLM-NLI-Analysis

Data URLs

https://github.com/mjhosseini/entgraph_eval/tree/master/LevyHoltDSRTE-1 (public RTE corpus)https://books.google.com/ngrams (Google N-grams)NewsCrawl (Barrault et al., 2019)

Risks & Boundaries

Limitations

Paper tests two biases but does not claim to cover all hallucination sources.

Google N-grams is a proxy for pretraining frequency and may not match private pretraining corpora exactly.

When Not To Use

Do not rely solely on model outputs for high-stakes inference tasks without bias controls.

Avoid using raw LLM predictions for knowledge extraction where user-provided context must be the only source of truth.

Failure Modes

Model asserts hypothesis because it matches memorized training sentences, not because the premise supports it.

Named entities act as memory indices, causing over-reliance on entity identity instead of predicate logic.

Core Entities

Models

LLaMA-65BGPT-3.5 (text-davinci-003)PaLM-540BGPT-4 (analysis in Appendix F)

Metrics

AUC normPrecisionRecallF1Folded Entail probability estimates

Datasets

Levy/Holt (directional NLI)RTE-1Google N-grams (1950-2019)NewsCrawl

Benchmarks

Natural Language Inference (NLI) directional subset

Context Entities

Models

AlpacaVicunaOPT, GPT-J (omitted)

Datasets

MMLU (excluded)Natural Questions (excluded)OpenBookQA (excluded)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Attestation (memorized sentence) strongly raises false positive entailments.

Relative term-frequency of predicates biases entailment decisions.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding