Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Hallucinations can damage trust, cause legal/financial harm, and break workflows. Because some hallucination is inevitable, companies must design systems that detect, ground, and escalate risky outputs rather than assuming perfect model truthfulness.
Summary TLDR
This 55-page survey formalizes what 'hallucination' means for LLMs, argues (with computability proofs) that some hallucination is inevitable, maps many hallucination subtypes (factual, intrinsic/extrinsic, temporal, ethical, code, multimodal, etc.), surveys benchmarks and metrics, and reviews mitigation patterns (RAG, tool calls, fine-tuning, guardrails, human-in-loop). Practical message: you cannot fully eliminate hallucination; focus on detection, grounding, hybrid safeguards, and human oversight.
Problem Statement
LLMs often generate plausible but incorrect or fabricated content. The field lacks a unified taxonomy and reliable, task-aware evaluation methods. The paper asks: what kinds of hallucination exist, why do they happen, how should we measure them, and how should practitioners mitigate their harms?
Main Contribution
A formal definition of hallucination and theoretical proofs that hallucination is unavoidable for computable LLMs.
A detailed, practical taxonomy that separates intrinsic vs extrinsic and factuality vs faithfulness, followed by many concrete subtypes (temporal, ethical, amalgamated, code, multimodal).
A survey of evaluation datasets and metrics (TruthfulQA, HalluLens, FActScore, Q2/QuestEval, MedHallu, CodeHaluEval, HALLUCINOGEN) and their blind spots.
A catalog of mitigation strategies and a practical recommendation: hybrid systems combining retrieval, tool use, fine-tuning, symbolic guardrails, and human-in-the-loop controls.
A curated list of web resources and leaderboards (Artificial Analysis, Vectara Hallucination Leaderboard, Epoch AI, LM Arena) for monitoring model behavior over time.
Key Findings
Hallucination is provably unavoidable for computable LLMs.
Logical inconsistencies form a non-trivial share of hallucinations (reported 19%).
Temporal errors account for a measurable share (reported 12%) and stem from outdated training data.
Ethical or harmful hallucinations are present and were measured at ~6% in some analyses.
Existing benchmarks and metrics are fragmented and miss subtle hallucinations; lack of standardization is a major limitation.
Combining approaches reduces risk: RAG, tool calls, fine-tuning, and symbolic guardrails work best together.
Model accuracy improves with compute and algorithmic progress; this correlates with lower hallucination tendency.
Results
Logical inconsistency share
Temporal disorientation share
Ethical/harmful hallucination share
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Add a simple retrieval step (search Wikipedia or internal docs) before answering time-sensitive user queries.
Surface source links and timestamps on model claims so users can verify quickly.
Implement a rule-based fallback: when confidence is low, ask for clarification or route to human review instead of inventing answers.
Agent Features
Tool Use
- Toolformer-style API/tool calls
- External calculator/code execution
Optimization Features
System Optimization
- Hybrid mitigation combining RAG, tools, and guardrails
Training Optimization
- Fine-tuning with synthetic/adversarially filtered data
Inference Optimization
- Tool delegation to external services
- Retrieval grounding to reduce parametric reliance
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Theoretical inevitability is proven in a formal computability setting; practical models and tasks may behave differently.
- Benchmarks are fragmented and task-dependent; no single metric detects all hallucination types.
- Empirical statistics cited are aggregated from multiple sources with varying annotation protocols.
- No shared code/data accompany this taxonomy to reproduce cross-benchmark comparisons.
When Not To Use
- Do not rely on vanilla LLM outputs alone for high-stakes medical, legal, or financial decisions.
- Avoid autonomous deployment without retrieval grounding and human oversight for safety-critical tasks.
Failure Modes
- Overconfidence: fluent but incorrect outputs that mislead users.
- Adversarial or fabricated prompts that induce false elaboration.
- Temporal drift: outdated training data leading to stale facts.
- Knowledge overshadowing and amalgamation that mix unrelated facts.
Core Entities
Models
- GPT-4/GPT-4o (referenced examples)
- Google Bard
- Claude (Anthropic)
- LLaMA family
- DeepSeek-R1
- o3-mini / o4-mini (OpenAI examples)
Metrics
- ROUGE
- BLEU
- BERTScore
- FactCC
- SummaC
- KILT-style grounding checks
- Retrieval-Augmented Evaluation (RAE)
- Human correctness/faithfulness/coherence/ harmfulness labels
Datasets
- TruthfulQA
- HalluLens
- FActScore
- Q2 (Quality Questioning)
- QuestEval
- MedHallu
- MedHallBench
- Med-HALT
- CodeHaluEval
- HALLUCINOGEN
- KILT
- PubMedQA (source for MedHallu)
Benchmarks
- TruthfulQA
- HalluLens
- FActScore
- Q2/QuestEval
- MedHallu/MedHallBench/Med-HALT
- CodeHaluEval
- HALLUCINOGEN
- LM Arena (open battles)
- Epoch AI benchmarking dashboard
- Vectara Hallucination Leaderboard
- Artificial Analysis Intelligence Index

