Practical survey: taxonomy, causes, detection, benchmarks, and fixes for hallucination in LLMs

Overview

Decision SnapshotNeeds Validation

The survey compiles broad evidence and references, making it actionable for engineers; specific solutions vary in maturity and cost, so apply recommendations selectively and validate on your workloads.

Citations207

Evidence Strength0.85

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu

Links

Abstract / PDF

Why It Matters For Business

Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.

Who Should Care

ML Engineer Product Manager Data Scientist CTO

Summary TLDR

This survey organizes what we know about hallucinations in large language models (LLMs). It proposes a clear two-part taxonomy (factuality vs faithfulness), traces causes across data, training, and inference, reviews detection methods and benchmarks, and maps mitigation techniques (data filtering, model editing, retrieval-augmentation, decoding and training fixes) to those causes. The paper flags practical gaps: retrieval-augmented systems still fail when retrieval or generation is weak, model editing and large-scale data filtering do not scale well, and vision-language models and knowledge-boundary probing need more work.

Problem Statement

LLMs often produce plausible but false or unverifiable text (hallucinations). Existing task-specific categories and defenses are incomplete for open-ended, instruction-following LLMs. We need a unified taxonomy, an account of root causes, robust detection benchmarks, and mitigation methods matched to causes.

Main Contribution

A clarified LLM-focused taxonomy splitting hallucinations into factuality and faithfulness types

A systematic analysis of causes across data, training, and inference stages

Key Findings

The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).

Practical UseDetect and fix factual errors and instruction/context mismatches with different tools: fact-checking or retrieval for factuality, and context-aware checks for faithfulness.

Evidence Ref§2.3

Large-scale evaluation collections exist and vary widely in size and focus; for example HaluEval 2.0 contains 8,770 hallucination-prone questions across five domains.

Numbers8,770 questions

Practical UseUse multi-domain benchmarks like HaluEval 2.0 when stress-testing models across realistic failure modes rather than small hand-crafted sets.

Evidence RefTable 4 / §4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HaluEval 2.0 dataset size	8,770 questions across 5 domains	—	—	HaluEval 2.0	Table 4 / §4.2	Table 4 / §4.2
TruthfulQA dataset size	817 adversarial questions	—	—	TruthfulQA	Table 4 / §4.2	Table 4 / §4.2

What To Try In 7 Days

Run your model on an adversarial benchmark (TruthfulQA) and a domain benchmark (HaluEval/FreshQA) to find weak areas

Enable retrieval only for low-confidence answers (adaptive retrieval) to limit noisy context injection

Add a lightweight uncertainty check (e.g., low token-prob threshold or sampling-consistency) before exposing facts to users

Agent Features

Memory

parametric (model weights)non-parametric (retrieval datastore)

Tool Use

RAG (retriever + generator)external verifiersknowledge graphs (KG prompting)

Frameworks

RLHFSFT

Architectures

transformerautoregressiveencoder-decoder

Optimization Features

Token Efficiency

context compression and summarizationselective retrieval

Model Optimization

attention-sharpening regularizersbidirectional autoregressive variants (BATGPT)

System Optimization

post-hoc verify-and-edit pipelinesspeculative decoding with nearest-neighbor

Training Optimization

in-context pretrainingtopic-prefix factuality augmentationup-sampling factual data

Inference Optimization

factual-nucleus samplingcontrastive decodingDoLa (layer-contrast decoding)inference-time activation intervention (ITI)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Survey summarizes many studies but does not release code or a unifying benchmark suite.

Model-editing methods do not scale cleanly to large, continuous updates.

When Not To Use

Do not rely solely on LLM internal checks (parametric-only) for high-stakes facts

Avoid blind retrieval for trivial or memorized facts; it can introduce noise

Failure Modes

Sycophancy: model favors pleasing answers over truth (RLHF-induced)

Over-confidence: high-probability hallucinated tokens propagate errors

Core Entities

Models

GPT-3GPT-4LLaMALlama-2ClaudeGeminiPaLM

Metrics

AccuracyAUROCBalanced AccPrecision/Recall/F1Likelihood scoreHuman judgmentLLM-judge Likert scores

Datasets

The PileTruthfulQAREALTIMEQAFreshQAHaluEvalHaluEval 2.0Med-HALTSelfCheckGPT-WikibioPopQAHead-to-TailBAMBOO

Benchmarks

TruthfulQAREALTIMEQAFreshQAHaluEvalHaluEval 2.0Med-HALTSelfCheckGPT-WikibioBAMBOOFELMPHDLSumSAC 3

Context Entities

Models

BARTPEGASUST5

Metrics

Entity overlapRelation triple overlapNLI-based entailment scoresQuestion-Answer matching scores

Datasets

Wiki-derived QA setsExpertQAMedHALTPopQA

Benchmarks

FEQA / QuestEval (QA-based faithfulness)FActScore (FACTSCORE)REALTIMEQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).

Large-scale evaluation collections exist and vary widely in size and focus; for example HaluEval 2.0 contains 8,770 hallucination-prone questions across five domains.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding