Overview
This survey compiles and synthesizes many published findings and practitioner methods; it is ready as a practical reference but not a plug-and-play implementation guide.
Citations233
Evidence Strength0.70
Confidence0.90
Risk Signals8
Trust Signals
Findings with numeric evidence: 2/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.
Who Should Care
Summary TLDR
This is a focused, up-to-date survey of hallucination in large language models (LLMs). It (1) defines three concrete hallucination types (input-, context-, and fact-conflicting), (2) catalogs benchmarks and metrics used today, (3) traces likely causes across the LLM lifecycle (pretrain→SFT→RLHF→inference), and (4) reviews mitigation techniques from data curation and alignment to retrieval, decoding tricks, uncertainty estimation, and multi-agent checks. The survey highlights practical gaps: evaluation mismatches with humans, multi-lingual and multi-modal blind spots, and trade-offs that can create over-conservative or unstable models.
Problem Statement
LLMs can produce plausible but false output (hallucinations) that harm trust and safety. This paper asks: how do we define and measure LLM hallucination, what causes it across model development stages, and which mitigation methods are practical and effective today?
Main Contribution
A clear taxonomy of LLM hallucinations: input-conflicting, context-conflicting, and fact-conflicting.
A consolidated review of benchmarks and evaluation methods for factuality and hallucination.
Key Findings
Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.
Most benchmark effort focuses on fact-conflicting hallucination, not input- or context-conflicts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | gpt-4o 87.9%; gpt-4-turbo 86.0%; Llama 3.1 70b 87.0% | — | — | HaluEval (table) | Table 6: reported model scores on HaluEval | Table 6 |
| Accuracy | gpt-4-turbo 59.0%; Mistral-Instruct-7B 52.3% | — | — | TruthfulQA (table) | Table 6 reports TruthfulQA results | Table 6 |
What To Try In 7 Days
Run a truthfulness benchmark (e.g., TruthfulQA or HaluEval) on your chosen LLM to get a baseline.
Enable retrieval (search/Wikipedia) for knowledge-heavy queries and track sources with every answer.
Switch decoding to greedy or factual-nucleus for high-stakes outputs and compare factuality vs. diversity trade-offs on a small test set.
Reproducibility
Risks & Boundaries
Limitations
Survey focuses heavily on fact-conflicting hallucination; input- and context-conflicts receive less coverage.
Most referenced benchmarks and evaluations are English-centric and may not generalize to low-resource languages.
When Not To Use
Not a step-by-step implementation manual for a single mitigation pipeline.
Not a substitute for domain expert review in high-risk fields like medicine or law.
Failure Modes
Automatic metrics can misalign with human judgments across domains and LLMs.
RAG can introduce conflicting evidence or run-time latency that degrades UX.

