Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
207
Why It Matters For Business
Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.
Summary TLDR
This survey organizes what we know about hallucinations in large language models (LLMs). It proposes a clear two-part taxonomy (factuality vs faithfulness), traces causes across data, training, and inference, reviews detection methods and benchmarks, and maps mitigation techniques (data filtering, model editing, retrieval-augmentation, decoding and training fixes) to those causes. The paper flags practical gaps: retrieval-augmented systems still fail when retrieval or generation is weak, model editing and large-scale data filtering do not scale well, and vision-language models and knowledge-boundary probing need more work.
Problem Statement
LLMs often produce plausible but false or unverifiable text (hallucinations). Existing task-specific categories and defenses are incomplete for open-ended, instruction-following LLMs. We need a unified taxonomy, an account of root causes, robust detection benchmarks, and mitigation methods matched to causes.
Main Contribution
A clarified LLM-focused taxonomy splitting hallucinations into factuality and faithfulness types
A systematic analysis of causes across data, training, and inference stages
A structured review of detection methods and a catalogue of existing benchmarks
A survey of mitigation techniques mapped to their causal origins (data, training, inference)
An in-depth discussion of limits in retrieval-augmented generation (RAG) and future directions (vision-language models, knowledge boundaries)
Key Findings
The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).
Large-scale evaluation collections exist and vary widely in size and focus; for example HaluEval 2.0 contains 8,770 hallucination-prone questions across five domains.
TruthfulQA, an adversarial benchmark for imitation-of-falsehoods, contains 817 questions designed to elicit training-data-driven hallucinations.
Retrieval-augmented generation (RAG) helps fill knowledge gaps but two dominant failure modes remain: retrieval failure and a generation bottleneck (model ignoring or misusing retrieved evidence).
Results
HaluEval 2.0 dataset size
TruthfulQA dataset size
Who Should Care
What To Try In 7 Days
Run your model on an adversarial benchmark (TruthfulQA) and a domain benchmark (HaluEval/FreshQA) to find weak areas
Enable retrieval only for low-confidence answers (adaptive retrieval) to limit noisy context injection
Add a lightweight uncertainty check (e.g., low token-prob threshold or sampling-consistency) before exposing facts to users
Agent Features
Memory
- parametric (model weights)
- non-parametric (retrieval datastore)
Tool Use
- RAG (retriever + generator)
- external verifiers
- knowledge graphs (KG prompting)
Frameworks
- RLHF
- SFT
Architectures
- transformer
- autoregressive
- encoder-decoder
Optimization Features
Token Efficiency
- context compression and summarization
- selective retrieval
Model Optimization
- attention-sharpening regularizers
- bidirectional autoregressive variants (BATGPT)
System Optimization
- post-hoc verify-and-edit pipelines
- speculative decoding with nearest-neighbor
Training Optimization
- in-context pretraining
- topic-prefix factuality augmentation
- up-sampling factual data
Inference Optimization
- factual-nucleus sampling
- contrastive decoding
- DoLa (layer-contrast decoding)
- inference-time activation intervention (ITI)
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey summarizes many studies but does not release code or a unifying benchmark suite.
- Model-editing methods do not scale cleanly to large, continuous updates.
- Data-filtering solutions face efficiency and coverage gaps at web scale.
- RAG solutions still suffer from retriever/source quality and generation misuse.
When Not To Use
- Do not rely solely on LLM internal checks (parametric-only) for high-stakes facts
- Avoid blind retrieval for trivial or memorized facts; it can introduce noise
- Avoid trusting LLM self-revision without external evidence for safety-critical content
Failure Modes
- Sycophancy: model favors pleasing answers over truth (RLHF-induced)
- Over-confidence: high-probability hallucinated tokens propagate errors
- Snowballing errors: early wrong token leads to cascading hallucination
- Retrieval bias: retrievers prefer LLM-generated or semantically incomplete contexts
- Lost-in-the-middle: important context in long windows is under-attended
Core Entities
Models
- GPT-3
- GPT-4
- LLaMA
- Llama-2
- Claude
- Gemini
- PaLM
Metrics
- Accuracy
- AUROC
- Balanced Acc
- Precision/Recall/F1
- Likelihood score
- Human judgment
- LLM-judge Likert scores
Datasets
- The Pile
- TruthfulQA
- REALTIMEQA
- FreshQA
- HaluEval
- HaluEval 2.0
- Med-HALT
- SelfCheckGPT-Wikibio
- PopQA
- Head-to-Tail
- BAMBOO
Benchmarks
- TruthfulQA
- REALTIMEQA
- FreshQA
- HaluEval
- HaluEval 2.0
- Med-HALT
- SelfCheckGPT-Wikibio
- BAMBOO
- FELM
- PHD
- LSum
- SAC 3
Context Entities
Models
- BART
- PEGASUS
- T5
Metrics
- Entity overlap
- Relation triple overlap
- NLI-based entailment scores
- Question-Answer matching scores
Datasets
- Wiki-derived QA sets
- ExpertQA
- MedHALT
- PopQA
Benchmarks
- FEQA / QuestEval (QA-based faithfulness)
- FActScore (FACTSCORE)
- REALTIMEQA

