Practical survey: taxonomy, causes, detection, benchmarks, and fixes for hallucination in LLMs

November 9, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

207

Authors

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu

Links

Abstract / PDF

Why It Matters For Business

Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.

Summary TLDR

This survey organizes what we know about hallucinations in large language models (LLMs). It proposes a clear two-part taxonomy (factuality vs faithfulness), traces causes across data, training, and inference, reviews detection methods and benchmarks, and maps mitigation techniques (data filtering, model editing, retrieval-augmentation, decoding and training fixes) to those causes. The paper flags practical gaps: retrieval-augmented systems still fail when retrieval or generation is weak, model editing and large-scale data filtering do not scale well, and vision-language models and knowledge-boundary probing need more work.

Problem Statement

LLMs often produce plausible but false or unverifiable text (hallucinations). Existing task-specific categories and defenses are incomplete for open-ended, instruction-following LLMs. We need a unified taxonomy, an account of root causes, robust detection benchmarks, and mitigation methods matched to causes.

Main Contribution

A clarified LLM-focused taxonomy splitting hallucinations into factuality and faithfulness types

A systematic analysis of causes across data, training, and inference stages

A structured review of detection methods and a catalogue of existing benchmarks

A survey of mitigation techniques mapped to their causal origins (data, training, inference)

An in-depth discussion of limits in retrieval-augmented generation (RAG) and future directions (vision-language models, knowledge boundaries)

Key Findings

The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).

Large-scale evaluation collections exist and vary widely in size and focus; for example HaluEval 2.0 contains 8,770 hallucination-prone questions across five domains.

Numbers8,770 questions

TruthfulQA, an adversarial benchmark for imitation-of-falsehoods, contains 817 questions designed to elicit training-data-driven hallucinations.

Numbers817 questions

Retrieval-augmented generation (RAG) helps fill knowledge gaps but two dominant failure modes remain: retrieval failure and a generation bottleneck (model ignoring or misusing retrieved evidence).

Results

HaluEval 2.0 dataset size

Value8,770 questions across 5 domains

TruthfulQA dataset size

Value817 adversarial questions

Who Should Care

What To Try In 7 Days

Run your model on an adversarial benchmark (TruthfulQA) and a domain benchmark (HaluEval/FreshQA) to find weak areas

Enable retrieval only for low-confidence answers (adaptive retrieval) to limit noisy context injection

Add a lightweight uncertainty check (e.g., low token-prob threshold or sampling-consistency) before exposing facts to users

Agent Features

Memory

  • parametric (model weights)
  • non-parametric (retrieval datastore)

Tool Use

  • RAG (retriever + generator)
  • external verifiers
  • knowledge graphs (KG prompting)

Frameworks

  • RLHF
  • SFT

Architectures

  • transformer
  • autoregressive
  • encoder-decoder

Optimization Features

Token Efficiency

  • context compression and summarization
  • selective retrieval

Model Optimization

  • attention-sharpening regularizers
  • bidirectional autoregressive variants (BATGPT)

System Optimization

  • post-hoc verify-and-edit pipelines
  • speculative decoding with nearest-neighbor

Training Optimization

  • in-context pretraining
  • topic-prefix factuality augmentation
  • up-sampling factual data

Inference Optimization

  • factual-nucleus sampling
  • contrastive decoding
  • DoLa (layer-contrast decoding)
  • inference-time activation intervention (ITI)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey summarizes many studies but does not release code or a unifying benchmark suite.
  • Model-editing methods do not scale cleanly to large, continuous updates.
  • Data-filtering solutions face efficiency and coverage gaps at web scale.
  • RAG solutions still suffer from retriever/source quality and generation misuse.

When Not To Use

  • Do not rely solely on LLM internal checks (parametric-only) for high-stakes facts
  • Avoid blind retrieval for trivial or memorized facts; it can introduce noise
  • Avoid trusting LLM self-revision without external evidence for safety-critical content

Failure Modes

  • Sycophancy: model favors pleasing answers over truth (RLHF-induced)
  • Over-confidence: high-probability hallucinated tokens propagate errors
  • Snowballing errors: early wrong token leads to cascading hallucination
  • Retrieval bias: retrievers prefer LLM-generated or semantically incomplete contexts
  • Lost-in-the-middle: important context in long windows is under-attended

Core Entities

Models

  • GPT-3
  • GPT-4
  • LLaMA
  • Llama-2
  • Claude
  • Gemini
  • PaLM

Metrics

  • Accuracy
  • AUROC
  • Balanced Acc
  • Precision/Recall/F1
  • Likelihood score
  • Human judgment
  • LLM-judge Likert scores

Datasets

  • The Pile
  • TruthfulQA
  • REALTIMEQA
  • FreshQA
  • HaluEval
  • HaluEval 2.0
  • Med-HALT
  • SelfCheckGPT-Wikibio
  • PopQA
  • Head-to-Tail
  • BAMBOO

Benchmarks

  • TruthfulQA
  • REALTIMEQA
  • FreshQA
  • HaluEval
  • HaluEval 2.0
  • Med-HALT
  • SelfCheckGPT-Wikibio
  • BAMBOO
  • FELM
  • PHD
  • LSum
  • SAC 3

Context Entities

Models

  • BART
  • PEGASUS
  • T5

Metrics

  • Entity overlap
  • Relation triple overlap
  • NLI-based entailment scores
  • Question-Answer matching scores

Datasets

  • Wiki-derived QA sets
  • ExpertQA
  • MedHALT
  • PopQA

Benchmarks

  • FEQA / QuestEval (QA-based faithfulness)
  • FActScore (FACTSCORE)
  • REALTIMEQA