Practical survey of why LLMs hallucinate, how we measure it, and what fixes work today

Overview

Decision SnapshotNeeds Validation

This survey compiles and synthesizes many published findings and practitioner methods; it is ready as a practical reference but not a plug-and-play implementation guide.

Citations233

Evidence Strength0.70

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi

Links

Abstract / PDF / Code

Why It Matters For Business

Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This is a focused, up-to-date survey of hallucination in large language models (LLMs). It (1) defines three concrete hallucination types (input-, context-, and fact-conflicting), (2) catalogs benchmarks and metrics used today, (3) traces likely causes across the LLM lifecycle (pretrain→SFT→RLHF→inference), and (4) reviews mitigation techniques from data curation and alignment to retrieval, decoding tricks, uncertainty estimation, and multi-agent checks. The survey highlights practical gaps: evaluation mismatches with humans, multi-lingual and multi-modal blind spots, and trade-offs that can create over-conservative or unstable models.

Problem Statement

LLMs can produce plausible but false output (hallucinations) that harm trust and safety. This paper asks: how do we define and measure LLM hallucination, what causes it across model development stages, and which mitigation methods are practical and effective today?

Main Contribution

A clear taxonomy of LLM hallucinations: input-conflicting, context-conflicting, and fact-conflicting.

A consolidated review of benchmarks and evaluation methods for factuality and hallucination.

Key Findings

Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.

Practical UsePick evaluation and mitigation methods that match the hallucination type (e.g., check source alignment for input-conflict; use consistency checks for context-conflict; use retrieval for fact-conflict).

Evidence Ref§2.2, Table 1

Most benchmark effort focuses on fact-conflicting hallucination, not input- or context-conflicts.

NumbersBenchmark list: TruthfulQA, FActScore, HaluEval, SimpleQA, etc.

Practical UseExpect more evaluation tools for factuality; add your own tests for input/context consistency when building apps.

Evidence Ref§2.2, §3.1, Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	gpt-4o 87.9%; gpt-4-turbo 86.0%; Llama 3.1 70b 87.0%	—	—	HaluEval (table)	Table 6: reported model scores on HaluEval	Table 6
Accuracy	gpt-4-turbo 59.0%; Mistral-Instruct-7B 52.3%	—	—	TruthfulQA (table)	Table 6 reports TruthfulQA results	Table 6

What To Try In 7 Days

Run a truthfulness benchmark (e.g., TruthfulQA or HaluEval) on your chosen LLM to get a baseline.

Enable retrieval (search/Wikipedia) for knowledge-heavy queries and track sources with every answer.

Switch decoding to greedy or factual-nucleus for high-stakes outputs and compare factuality vs. diversity trade-offs on a small test set.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/HillZhang1999/llm-hallucination-survey

Risks & Boundaries

Limitations

Survey focuses heavily on fact-conflicting hallucination; input- and context-conflicts receive less coverage.

Most referenced benchmarks and evaluations are English-centric and may not generalize to low-resource languages.

When Not To Use

Not a step-by-step implementation manual for a single mitigation pipeline.

Not a substitute for domain expert review in high-risk fields like medicine or law.

Failure Modes

Automatic metrics can misalign with human judgments across domains and LLMs.

RAG can introduce conflicting evidence or run-time latency that degrades UX.

Core Entities

Models

GPT-4GPT-4ogpt-4-turboGPT-3.5-TurboLLaMALlama 2Llama 3.1Claude-3

Metrics

AccuracyMicro F1ROUGE-L (F1)BERTScoreAlignScoreSelf-contrast / Self-consistencyPrecision/Recall/F1 (detection)

Datasets

TruthfulQAFActScoreHaluEvalSimpleQAKoLA-KCSAFEHall uQAFELMAlpaca

Benchmarks

TruthfulQAFActScoreHaluEvalSimpleQAKoLA-KCSAFE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.

Most benchmark effort focuses on fact-conflicting hallucination, not input- or context-conflicts.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding