Practical survey: taxonomy, causes, detection, benchmarks, and fixes for hallucination in LLMs

November 9, 20237 min

Overview

Decision SnapshotNeeds Validation

The survey compiles broad evidence and references, making it actionable for engineers; specific solutions vary in maturity and cost, so apply recommendations selectively and validate on your workloads.

Citations207

Evidence Strength0.85

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu

Links

Abstract / PDF

Why It Matters For Business

Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.

Who Should Care

Summary TLDR

This survey organizes what we know about hallucinations in large language models (LLMs). It proposes a clear two-part taxonomy (factuality vs faithfulness), traces causes across data, training, and inference, reviews detection methods and benchmarks, and maps mitigation techniques (data filtering, model editing, retrieval-augmentation, decoding and training fixes) to those causes. The paper flags practical gaps: retrieval-augmented systems still fail when retrieval or generation is weak, model editing and large-scale data filtering do not scale well, and vision-language models and knowledge-boundary probing need more work.

Problem Statement

LLMs often produce plausible but false or unverifiable text (hallucinations). Existing task-specific categories and defenses are incomplete for open-ended, instruction-following LLMs. We need a unified taxonomy, an account of root causes, robust detection benchmarks, and mitigation methods matched to causes.

Main Contribution

A clarified LLM-focused taxonomy splitting hallucinations into factuality and faithfulness types

A systematic analysis of causes across data, training, and inference stages

Key Findings

The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).

Practical UseDetect and fix factual errors and instruction/context mismatches with different tools: fact-checking or retrieval for factuality, and context-aware checks for faithfulness.

Evidence Ref§2.3

Large-scale evaluation collections exist and vary widely in size and focus; for example HaluEval 2.0 contains 8,770 hallucination-prone questions across five domains.

Numbers8,770 questions

Practical UseUse multi-domain benchmarks like HaluEval 2.0 when stress-testing models across realistic failure modes rather than small hand-crafted sets.

Evidence RefTable 4 / §4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HaluEval 2.0 dataset size8,770 questions across 5 domainsHaluEval 2.0Table 4 / §4.2Table 4 / §4.2
TruthfulQA dataset size817 adversarial questionsTruthfulQATable 4 / §4.2Table 4 / §4.2

What To Try In 7 Days

Run your model on an adversarial benchmark (TruthfulQA) and a domain benchmark (HaluEval/FreshQA) to find weak areas

Enable retrieval only for low-confidence answers (adaptive retrieval) to limit noisy context injection

Add a lightweight uncertainty check (e.g., low token-prob threshold or sampling-consistency) before exposing facts to users

Agent Features

Memory
parametric (model weights)non-parametric (retrieval datastore)
Tool Use
RAG (retriever + generator)external verifiersknowledge graphs (KG prompting)
Frameworks
RLHFSFT
Architectures
transformerautoregressiveencoder-decoder

Optimization Features

Token Efficiency
context compression and summarizationselective retrieval
Model Optimization
attention-sharpening regularizersbidirectional autoregressive variants (BATGPT)
System Optimization
post-hoc verify-and-edit pipelinesspeculative decoding with nearest-neighbor
Training Optimization
in-context pretrainingtopic-prefix factuality augmentationup-sampling factual data
Inference Optimization
factual-nucleus samplingcontrastive decodingDoLa (layer-contrast decoding)inference-time activation intervention (ITI)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey summarizes many studies but does not release code or a unifying benchmark suite.

Model-editing methods do not scale cleanly to large, continuous updates.

When Not To Use

Do not rely solely on LLM internal checks (parametric-only) for high-stakes facts

Avoid blind retrieval for trivial or memorized facts; it can introduce noise

Failure Modes

Sycophancy: model favors pleasing answers over truth (RLHF-induced)

Over-confidence: high-probability hallucinated tokens propagate errors

Core Entities

Models

GPT-3GPT-4LLaMALlama-2ClaudeGeminiPaLM

Metrics

AccuracyAUROCBalanced AccPrecision/Recall/F1Likelihood scoreHuman judgmentLLM-judge Likert scores

Datasets

The PileTruthfulQAREALTIMEQAFreshQAHaluEvalHaluEval 2.0Med-HALTSelfCheckGPT-WikibioPopQAHead-to-TailBAMBOO

Benchmarks

TruthfulQAREALTIMEQAFreshQAHaluEvalHaluEval 2.0Med-HALTSelfCheckGPT-WikibioBAMBOOFELMPHDLSumSAC 3

Context Entities

Models

BARTPEGASUST5

Metrics

Entity overlapRelation triple overlapNLI-based entailment scoresQuestion-Answer matching scores

Datasets

Wiki-derived QA setsExpertQAMedHALTPopQA

Benchmarks

FEQA / QuestEval (QA-based faithfulness)FActScore (FACTSCORE)REALTIMEQA