Overview
The survey compiles broad evidence and references, making it actionable for engineers; specific solutions vary in maturity and cost, so apply recommendations selectively and validate on your workloads.
Citations207
Evidence Strength0.85
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 2/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/2
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 40%
Why It Matters For Business
Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.
Who Should Care
Summary TLDR
This survey organizes what we know about hallucinations in large language models (LLMs). It proposes a clear two-part taxonomy (factuality vs faithfulness), traces causes across data, training, and inference, reviews detection methods and benchmarks, and maps mitigation techniques (data filtering, model editing, retrieval-augmentation, decoding and training fixes) to those causes. The paper flags practical gaps: retrieval-augmented systems still fail when retrieval or generation is weak, model editing and large-scale data filtering do not scale well, and vision-language models and knowledge-boundary probing need more work.
Problem Statement
LLMs often produce plausible but false or unverifiable text (hallucinations). Existing task-specific categories and defenses are incomplete for open-ended, instruction-following LLMs. We need a unified taxonomy, an account of root causes, robust detection benchmarks, and mitigation methods matched to causes.
Main Contribution
A clarified LLM-focused taxonomy splitting hallucinations into factuality and faithfulness types
A systematic analysis of causes across data, training, and inference stages
Key Findings
The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).
Large-scale evaluation collections exist and vary widely in size and focus; for example HaluEval 2.0 contains 8,770 hallucination-prone questions across five domains.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| HaluEval 2.0 dataset size | 8,770 questions across 5 domains | — | — | HaluEval 2.0 | Table 4 / §4.2 | Table 4 / §4.2 |
| TruthfulQA dataset size | 817 adversarial questions | — | — | TruthfulQA | Table 4 / §4.2 | Table 4 / §4.2 |
What To Try In 7 Days
Run your model on an adversarial benchmark (TruthfulQA) and a domain benchmark (HaluEval/FreshQA) to find weak areas
Enable retrieval only for low-confidence answers (adaptive retrieval) to limit noisy context injection
Add a lightweight uncertainty check (e.g., low token-prob threshold or sampling-consistency) before exposing facts to users
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Survey summarizes many studies but does not release code or a unifying benchmark suite.
Model-editing methods do not scale cleanly to large, continuous updates.
When Not To Use
Do not rely solely on LLM internal checks (parametric-only) for high-stakes facts
Avoid blind retrieval for trivial or memorized facts; it can introduce noise
Failure Modes
Sycophancy: model favors pleasing answers over truth (RLHF-induced)
Over-confidence: high-probability hallucinated tokens propagate errors

