Overview
The dataset and experiments convincingly show citation-driven increases in hallucination across three models. However, labeling relied on GPT-4.1 without web access and no human labels, which weakens ground-truth reliability. Internal-state clustering is exploratory and not yet production-ready.
Citations0
Evidence Strength0.60
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 2/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Any system that injects or displays citations can materially change model outputs. Fabricated or unverified citations raise the risk of confident but false statements. This affects chatbots, summarizers, and automated reporting where downstream users assume citations imply truth.
Who Should Care
Summary TLDR
The authors build FalseCite, an 82k-example benchmark of false claims (from FEVER and SciQ) with optional fabricated citations. They show false citations raise hallucination rates across models (largest relative jumps in GPT-4o-mini). They also extract hidden-state and attention summaries and cluster them, finding a recurring horn-shaped trajectory and clusters with slightly higher hallucination density. Labeling relied on GPT-4.1 as an expert annotator (no web access), which limits verification.
Problem Statement
LLMs often invent facts. The paper asks whether adding deceptive or fabricated citations makes them more likely to hallucinate, and whether internal model activations reveal patterns tied to hallucination.
Main Contribution
FalseCite: a curated benchmark of ~82k false claims drawn from FEVER (47k false claims) and SciQ (35k false scientific statements).
Empirical finding: fabricated citations increase hallucination rates across tested models, with random (mismatched) citations often producing the largest effect.
Key Findings
Adding a fabricated citation substantially increases hallucination rates.
Random (semantically mismatched) citations often trigger larger increases than semantically matched citations for smaller models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Hallucination rate (No citation) | GPT-4o-mini 23.97%; Falcon-7B 62.45%; Mistral-7B 34.56% | No citation | — | FalseCite (uncited claims) | Table 2 / Appendix A | Table 2 / Appendix A |
| Hallucination rate (Random citation) | GPT-4o-mini 63.62%; Falcon-7B 77.91%; Mistral-7B 53.28% | No citation | GPT-4o-mini +39.65 pts; Falcon-7B +15.46 pts; Mistral-7B +18.72 pts | FalseCite (randomly paired citations) | Table 2 / Appendix A | Table 2 / Appendix A |
What To Try In 7 Days
Run an audit: sample outputs where your system inserts citation text and measure hallucination frequency versus a no-citation baseline.
Add a citation-verification step: block or flag generated outputs that cite non-verified sources.
Use token-level checks or a secondary verifier (RAG or human) before exposing citation-backed claims to end users.
Reproducibility
Risks & Boundaries
Limitations
Primary labeling used GPT-4.1 as an expert annotator without Internet access; it cannot check whether a cited source actually exists.
No human annotation or RAG-based verification was used due to resource constraints.
When Not To Use
Do not rely on FalseCite results as definitive for safety-critical systems without human verification of labels.
Avoid using the activation-clustering procedure as a standalone hallucination detector in production.
Failure Modes
Expert annotator mislabels plausible but fabricated citations as correct because it lacks web access.
Semantic pairing makes citations believable; models may amplify errors confident in fabricated support.

