FalseCite: a benchmark showing that fabricated citations make LLMs hallucinate more — and that internal activations trace a horn-like shape

Overview

Decision SnapshotNeeds Validation

The dataset and experiments convincingly show citation-driven increases in hallucination across three models. However, labeling relied on GPT-4.1 without web access and no human labels, which weakens ground-truth reliability. Internal-state clustering is exploratory and not yet production-ready.

Citations0

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Nathan Mao, Varun Kaushik, Shreya Shivkumar, Parham Sharafoleslami, Kevin Zhu, Sunishchal Dev

Links

Abstract / PDF

Why It Matters For Business

Any system that injects or displays citations can materially change model outputs. Fabricated or unverified citations raise the risk of confident but false statements. This affects chatbots, summarizers, and automated reporting where downstream users assume citations imply truth.

Who Should Care

Product Manager ML Engineer CTO Founder Data Scientist

Summary TLDR

The authors build FalseCite, an 82k-example benchmark of false claims (from FEVER and SciQ) with optional fabricated citations. They show false citations raise hallucination rates across models (largest relative jumps in GPT-4o-mini). They also extract hidden-state and attention summaries and cluster them, finding a recurring horn-shaped trajectory and clusters with slightly higher hallucination density. Labeling relied on GPT-4.1 as an expert annotator (no web access), which limits verification.

Problem Statement

LLMs often invent facts. The paper asks whether adding deceptive or fabricated citations makes them more likely to hallucinate, and whether internal model activations reveal patterns tied to hallucination.

Main Contribution

FalseCite: a curated benchmark of ~82k false claims drawn from FEVER (47k false claims) and SciQ (35k false scientific statements).

Empirical finding: fabricated citations increase hallucination rates across tested models, with random (mismatched) citations often producing the largest effect.

Key Findings

Adding a fabricated citation substantially increases hallucination rates.

NumbersGPT-4o-mini: 23.97% → 63.62% (random) (+39.65 pts); Falcon-7B: 62.45% → 77.91% (+15.46 pts).

Practical UseDon’t feed LLMs spurious citations. If your pipeline adds citations or metadata, verify source authenticity or prefer verification steps before exposing the model.

Evidence RefTable 2 / Appendix A (hallucination rates per model and citation condition).

Random (semantically mismatched) citations often trigger larger increases than semantically matched citations for smaller models.

NumbersFalcon-7B Δ: random +15.46 pts vs semantic +8.38 pts; Mistral-7B Δ: random +18.72 pts vs semantic +11.26 pts.

Practical UseEven irrelevant or obviously wrong citations can push a model to invent supporting details. Treat any injected citation text as a potential risk vector in generation pipelines.

Evidence RefTable 2 / Appendix A (Falcon-7B and Mistral-7B deltas).

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hallucination rate (No citation)	GPT-4o-mini 23.97%; Falcon-7B 62.45%; Mistral-7B 34.56%	No citation	—	FalseCite (uncited claims)	Table 2 / Appendix A	Table 2 / Appendix A
Hallucination rate (Random citation)	GPT-4o-mini 63.62%; Falcon-7B 77.91%; Mistral-7B 53.28%	No citation	GPT-4o-mini +39.65 pts; Falcon-7B +15.46 pts; Mistral-7B +18.72 pts	FalseCite (randomly paired citations)	Table 2 / Appendix A	Table 2 / Appendix A

What To Try In 7 Days

Run an audit: sample outputs where your system inserts citation text and measure hallucination frequency versus a no-citation baseline.

Add a citation-verification step: block or flag generated outputs that cite non-verified sources.

Use token-level checks or a secondary verifier (RAG or human) before exposing citation-backed claims to end users.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Primary labeling used GPT-4.1 as an expert annotator without Internet access; it cannot check whether a cited source actually exists.

No human annotation or RAG-based verification was used due to resource constraints.

When Not To Use

Do not rely on FalseCite results as definitive for safety-critical systems without human verification of labels.

Avoid using the activation-clustering procedure as a standalone hallucination detector in production.

Failure Modes

Expert annotator mislabels plausible but fabricated citations as correct because it lacks web access.

Semantic pairing makes citations believable; models may amplify errors confident in fabricated support.

Core Entities

Models

GPT-4o-miniGPT-4.1 (expert annotator)Falcon-7BMistral-7B

Metrics

hallucination rate (%)AccuracyΔ hallucination (absolute percentage points)

Datasets

FalseCite (this paper, ~82k false claims)FEVERSciQAccuracy

Benchmarks

FalseCiteTruthfulQAHaluEval

Context Entities

Models

GPT-4o-miniFalcon-7BMistral-7BGPT-4.1

Metrics

hallucination rate

Datasets

FalseCiteFEVERSciQ

Benchmarks

FalseCite

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding a fabricated citation substantially increases hallucination rates.

Random (semantically mismatched) citations often trigger larger increases than semantically matched citations for smaller models.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding