Practical survey of why LLMs hallucinate, how we measure it, and what fixes work today

September 3, 20237 min

Overview

Decision SnapshotNeeds Validation

This survey compiles and synthesizes many published findings and practitioner methods; it is ready as a practical reference but not a plug-and-play implementation guide.

Citations233

Evidence Strength0.70

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi

Links

Abstract / PDF / Code

Why It Matters For Business

Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.

Who Should Care

Summary TLDR

This is a focused, up-to-date survey of hallucination in large language models (LLMs). It (1) defines three concrete hallucination types (input-, context-, and fact-conflicting), (2) catalogs benchmarks and metrics used today, (3) traces likely causes across the LLM lifecycle (pretrain→SFT→RLHF→inference), and (4) reviews mitigation techniques from data curation and alignment to retrieval, decoding tricks, uncertainty estimation, and multi-agent checks. The survey highlights practical gaps: evaluation mismatches with humans, multi-lingual and multi-modal blind spots, and trade-offs that can create over-conservative or unstable models.

Problem Statement

LLMs can produce plausible but false output (hallucinations) that harm trust and safety. This paper asks: how do we define and measure LLM hallucination, what causes it across model development stages, and which mitigation methods are practical and effective today?

Main Contribution

A clear taxonomy of LLM hallucinations: input-conflicting, context-conflicting, and fact-conflicting.

A consolidated review of benchmarks and evaluation methods for factuality and hallucination.

Key Findings

Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.

Practical UsePick evaluation and mitigation methods that match the hallucination type (e.g., check source alignment for input-conflict; use consistency checks for context-conflict; use retrieval for fact-conflict).

Evidence Ref§2.2, Table 1

Most benchmark effort focuses on fact-conflicting hallucination, not input- or context-conflicts.

NumbersBenchmark list: TruthfulQA, FActScore, HaluEval, SimpleQA, etc.

Practical UseExpect more evaluation tools for factuality; add your own tests for input/context consistency when building apps.

Evidence Ref§2.2, §3.1, Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracygpt-4o 87.9%; gpt-4-turbo 86.0%; Llama 3.1 70b 87.0%HaluEval (table)Table 6: reported model scores on HaluEvalTable 6
Accuracygpt-4-turbo 59.0%; Mistral-Instruct-7B 52.3%TruthfulQA (table)Table 6 reports TruthfulQA resultsTable 6

What To Try In 7 Days

Run a truthfulness benchmark (e.g., TruthfulQA or HaluEval) on your chosen LLM to get a baseline.

Enable retrieval (search/Wikipedia) for knowledge-heavy queries and track sources with every answer.

Switch decoding to greedy or factual-nucleus for high-stakes outputs and compare factuality vs. diversity trade-offs on a small test set.

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey focuses heavily on fact-conflicting hallucination; input- and context-conflicts receive less coverage.

Most referenced benchmarks and evaluations are English-centric and may not generalize to low-resource languages.

When Not To Use

Not a step-by-step implementation manual for a single mitigation pipeline.

Not a substitute for domain expert review in high-risk fields like medicine or law.

Failure Modes

Automatic metrics can misalign with human judgments across domains and LLMs.

RAG can introduce conflicting evidence or run-time latency that degrades UX.

Core Entities

Models

GPT-4GPT-4ogpt-4-turboGPT-3.5-TurboLLaMALlama 2Llama 3.1Claude-3

Metrics

AccuracyMicro F1ROUGE-L (F1)BERTScoreAlignScoreSelf-contrast / Self-consistencyPrecision/Recall/F1 (detection)

Datasets

TruthfulQAFActScoreHaluEvalSimpleQAKoLA-KCSAFEHall uQAFELMAlpaca

Benchmarks

TruthfulQAFActScoreHaluEvalSimpleQAKoLA-KCSAFE