Practical survey of why LLMs hallucinate, how we measure it, and what fixes work today

September 3, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

233

Authors

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi

Links

Abstract / PDF

Why It Matters For Business

Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.

Summary TLDR

This is a focused, up-to-date survey of hallucination in large language models (LLMs). It (1) defines three concrete hallucination types (input-, context-, and fact-conflicting), (2) catalogs benchmarks and metrics used today, (3) traces likely causes across the LLM lifecycle (pretrain→SFT→RLHF→inference), and (4) reviews mitigation techniques from data curation and alignment to retrieval, decoding tricks, uncertainty estimation, and multi-agent checks. The survey highlights practical gaps: evaluation mismatches with humans, multi-lingual and multi-modal blind spots, and trade-offs that can create over-conservative or unstable models.

Problem Statement

LLMs can produce plausible but false output (hallucinations) that harm trust and safety. This paper asks: how do we define and measure LLM hallucination, what causes it across model development stages, and which mitigation methods are practical and effective today?

Main Contribution

A clear taxonomy of LLM hallucinations: input-conflicting, context-conflicting, and fact-conflicting.

A consolidated review of benchmarks and evaluation methods for factuality and hallucination.

A life-cycle analysis of hallucination sources (pretraining, SFT, RLHF, decoding), with mapped mitigation techniques.

A practical inventory of mitigation methods: data curation, honesty-oriented SFT/RL, retrieval/tool use, decoding interventions, uncertainty estimation, and multi-agent checks.

A public pointer to continuously updated resources and code at the authors' GitHub repo.

Key Findings

Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.

Most benchmark effort focuses on fact-conflicting hallucination, not input- or context-conflicts.

NumbersBenchmark list: TruthfulQA, FActScore, HaluEval, SimpleQA, etc.

Retrieval and tool-augmented methods (RAG, search, tool calls) consistently reduce factual errors in practice, but introduce verifier and efficiency trade-offs.

Model alignment via RLHF can boost truthfulness but may create over-conservatism where the model refuses answerable queries.

NumbersGPT4 accuracy uplift on TruthfulQA ~30% → ~60% reported

Decoding and uncertainty detection are low-cost, effective inference-time tools: greedy/factual-aware decoding and consistency or verbalized confidence help flag or reduce hallucinations.

Results

Accuracy

Valuegpt-4o 87.9%; gpt-4-turbo 86.0%; Llama 3.1 70b 87.0%

Accuracy

Valuegpt-4-turbo 59.0%; Mistral-Instruct-7B 52.3%

Pretraining data scale examples

ValueLlama 2 ~2T tokens, LLaMA ~1.4T, GPT-3 ~300B

Who Should Care

What To Try In 7 Days

Run a truthfulness benchmark (e.g., TruthfulQA or HaluEval) on your chosen LLM to get a baseline.

Enable retrieval (search/Wikipedia) for knowledge-heavy queries and track sources with every answer.

Switch decoding to greedy or factual-nucleus for high-stakes outputs and compare factuality vs. diversity trade-offs on a small test set.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey focuses heavily on fact-conflicting hallucination; input- and context-conflicts receive less coverage.
  • Most referenced benchmarks and evaluations are English-centric and may not generalize to low-resource languages.
  • As a survey, it summarizes work but does not provide new empirical experiments or single-source benchmarks.

When Not To Use

  • Not a step-by-step implementation manual for a single mitigation pipeline.
  • Not a substitute for domain expert review in high-risk fields like medicine or law.

Failure Modes

  • Automatic metrics can misalign with human judgments across domains and LLMs.
  • RAG can introduce conflicting evidence or run-time latency that degrades UX.
  • RLHF or honesty tuning can make models over-conservative and refuse valid answers.

Core Entities

Models

  • GPT-4
  • GPT-4o
  • gpt-4-turbo
  • GPT-3.5-Turbo
  • LLaMA
  • Llama 2
  • Llama 3.1
  • Claude-3

Metrics

  • Accuracy
  • Micro F1
  • ROUGE-L (F1)
  • BERTScore
  • AlignScore
  • Self-contrast / Self-consistency
  • Precision/Recall/F1 (detection)

Datasets

  • TruthfulQA
  • FActScore
  • HaluEval
  • SimpleQA
  • KoLA-KC
  • SAFE
  • Hall uQA
  • FELM
  • Alpaca

Benchmarks

  • TruthfulQA
  • FActScore
  • HaluEval
  • SimpleQA
  • KoLA-KC
  • SAFE