Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
233
Why It Matters For Business
Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.
Summary TLDR
This is a focused, up-to-date survey of hallucination in large language models (LLMs). It (1) defines three concrete hallucination types (input-, context-, and fact-conflicting), (2) catalogs benchmarks and metrics used today, (3) traces likely causes across the LLM lifecycle (pretrain→SFT→RLHF→inference), and (4) reviews mitigation techniques from data curation and alignment to retrieval, decoding tricks, uncertainty estimation, and multi-agent checks. The survey highlights practical gaps: evaluation mismatches with humans, multi-lingual and multi-modal blind spots, and trade-offs that can create over-conservative or unstable models.
Problem Statement
LLMs can produce plausible but false output (hallucinations) that harm trust and safety. This paper asks: how do we define and measure LLM hallucination, what causes it across model development stages, and which mitigation methods are practical and effective today?
Main Contribution
A clear taxonomy of LLM hallucinations: input-conflicting, context-conflicting, and fact-conflicting.
A consolidated review of benchmarks and evaluation methods for factuality and hallucination.
A life-cycle analysis of hallucination sources (pretraining, SFT, RLHF, decoding), with mapped mitigation techniques.
A practical inventory of mitigation methods: data curation, honesty-oriented SFT/RL, retrieval/tool use, decoding interventions, uncertainty estimation, and multi-agent checks.
A public pointer to continuously updated resources and code at the authors' GitHub repo.
Key Findings
Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.
Most benchmark effort focuses on fact-conflicting hallucination, not input- or context-conflicts.
Retrieval and tool-augmented methods (RAG, search, tool calls) consistently reduce factual errors in practice, but introduce verifier and efficiency trade-offs.
Model alignment via RLHF can boost truthfulness but may create over-conservatism where the model refuses answerable queries.
Decoding and uncertainty detection are low-cost, effective inference-time tools: greedy/factual-aware decoding and consistency or verbalized confidence help flag or reduce hallucinations.
Results
Accuracy
Accuracy
Pretraining data scale examples
Who Should Care
What To Try In 7 Days
Run a truthfulness benchmark (e.g., TruthfulQA or HaluEval) on your chosen LLM to get a baseline.
Enable retrieval (search/Wikipedia) for knowledge-heavy queries and track sources with every answer.
Switch decoding to greedy or factual-nucleus for high-stakes outputs and compare factuality vs. diversity trade-offs on a small test set.
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey focuses heavily on fact-conflicting hallucination; input- and context-conflicts receive less coverage.
- Most referenced benchmarks and evaluations are English-centric and may not generalize to low-resource languages.
- As a survey, it summarizes work but does not provide new empirical experiments or single-source benchmarks.
When Not To Use
- Not a step-by-step implementation manual for a single mitigation pipeline.
- Not a substitute for domain expert review in high-risk fields like medicine or law.
Failure Modes
- Automatic metrics can misalign with human judgments across domains and LLMs.
- RAG can introduce conflicting evidence or run-time latency that degrades UX.
- RLHF or honesty tuning can make models over-conservative and refuse valid answers.
Core Entities
Models
- GPT-4
- GPT-4o
- gpt-4-turbo
- GPT-3.5-Turbo
- LLaMA
- Llama 2
- Llama 3.1
- Claude-3
Metrics
- Accuracy
- Micro F1
- ROUGE-L (F1)
- BERTScore
- AlignScore
- Self-contrast / Self-consistency
- Precision/Recall/F1 (detection)
Datasets
- TruthfulQA
- FActScore
- HaluEval
- SimpleQA
- KoLA-KC
- SAFE
- Hall uQA
- FELM
- Alpaca
Benchmarks
- TruthfulQA
- FActScore
- HaluEval
- SimpleQA
- KoLA-KC
- SAFE

