FalseCite: a benchmark showing that fabricated citations make LLMs hallucinate more — and that internal activations trace a horn-like shape

January 18, 20267 min

Overview

Decision SnapshotNeeds Validation

The dataset and experiments convincingly show citation-driven increases in hallucination across three models. However, labeling relied on GPT-4.1 without web access and no human labels, which weakens ground-truth reliability. Internal-state clustering is exploratory and not yet production-ready.

Citations0

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Nathan Mao, Varun Kaushik, Shreya Shivkumar, Parham Sharafoleslami, Kevin Zhu, Sunishchal Dev

Links

Abstract / PDF

Why It Matters For Business

Any system that injects or displays citations can materially change model outputs. Fabricated or unverified citations raise the risk of confident but false statements. This affects chatbots, summarizers, and automated reporting where downstream users assume citations imply truth.

Who Should Care

Summary TLDR

The authors build FalseCite, an 82k-example benchmark of false claims (from FEVER and SciQ) with optional fabricated citations. They show false citations raise hallucination rates across models (largest relative jumps in GPT-4o-mini). They also extract hidden-state and attention summaries and cluster them, finding a recurring horn-shaped trajectory and clusters with slightly higher hallucination density. Labeling relied on GPT-4.1 as an expert annotator (no web access), which limits verification.

Problem Statement

LLMs often invent facts. The paper asks whether adding deceptive or fabricated citations makes them more likely to hallucinate, and whether internal model activations reveal patterns tied to hallucination.

Main Contribution

FalseCite: a curated benchmark of ~82k false claims drawn from FEVER (47k false claims) and SciQ (35k false scientific statements).

Empirical finding: fabricated citations increase hallucination rates across tested models, with random (mismatched) citations often producing the largest effect.

Key Findings

Adding a fabricated citation substantially increases hallucination rates.

NumbersGPT-4o-mini: 23.97%63.62% (random) (+39.65 pts); Falcon-7B: 62.45%77.91% (+15.46 pts).

Practical UseDon’t feed LLMs spurious citations. If your pipeline adds citations or metadata, verify source authenticity or prefer verification steps before exposing the model.

Evidence RefTable 2 / Appendix A (hallucination rates per model and citation condition).

Random (semantically mismatched) citations often trigger larger increases than semantically matched citations for smaller models.

NumbersFalcon-7B Δ: random +15.46 pts vs semantic +8.38 pts; Mistral-7B Δ: random +18.72 pts vs semantic +11.26 pts.

Practical UseEven irrelevant or obviously wrong citations can push a model to invent supporting details. Treat any injected citation text as a potential risk vector in generation pipelines.

Evidence RefTable 2 / Appendix A (Falcon-7B and Mistral-7B deltas).

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Hallucination rate (No citation)GPT-4o-mini 23.97%; Falcon-7B 62.45%; Mistral-7B 34.56%No citationFalseCite (uncited claims)Table 2 / Appendix ATable 2 / Appendix A
Hallucination rate (Random citation)GPT-4o-mini 63.62%; Falcon-7B 77.91%; Mistral-7B 53.28%No citationGPT-4o-mini +39.65 pts; Falcon-7B +15.46 pts; Mistral-7B +18.72 ptsFalseCite (randomly paired citations)Table 2 / Appendix ATable 2 / Appendix A

What To Try In 7 Days

Run an audit: sample outputs where your system inserts citation text and measure hallucination frequency versus a no-citation baseline.

Add a citation-verification step: block or flag generated outputs that cite non-verified sources.

Use token-level checks or a secondary verifier (RAG or human) before exposing citation-backed claims to end users.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Primary labeling used GPT-4.1 as an expert annotator without Internet access; it cannot check whether a cited source actually exists.

No human annotation or RAG-based verification was used due to resource constraints.

When Not To Use

Do not rely on FalseCite results as definitive for safety-critical systems without human verification of labels.

Avoid using the activation-clustering procedure as a standalone hallucination detector in production.

Failure Modes

Expert annotator mislabels plausible but fabricated citations as correct because it lacks web access.

Semantic pairing makes citations believable; models may amplify errors confident in fabricated support.

Core Entities

Models

GPT-4o-miniGPT-4.1 (expert annotator)Falcon-7BMistral-7B

Metrics

hallucination rate (%)AccuracyΔ hallucination (absolute percentage points)

Datasets

FalseCite (this paper, ~82k false claims)FEVERSciQAccuracy

Benchmarks

FalseCiteTruthfulQAHaluEval

Context Entities

Models

GPT-4o-miniFalcon-7BMistral-7BGPT-4.1

Metrics

hallucination rate

Datasets

FalseCiteFEVERSciQ

Benchmarks

FalseCite