Large-scale tests show where hallucinations come from, when common fixes help, and when they backfire

January 6, 202410 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

7

Authors

Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen

Links

Abstract / PDF

Why It Matters For Business

Hallucinations cause real-world harm (wrong facts, bad decisions). The paper gives practical, tested levers—retrieve relevant docs, apply RLHF, tune instruction mix, and be careful with quantization and aggressive sampling—so teams can reduce factual errors quickly.

Summary TLDR

This paper builds HaluEval 2.0 (8,770 fact-focused questions across biomedicine, finance, science, education, open domain) and runs many LLMs through a GPT-4 based two-step detector (extract facts, judge truth). Key takeaways: the detector matches humans (~92–95% per domain); pretraining scale alone helps little, but domain-specific pretraining and frequent facts reduce hallucinations; instruction tuning and RLHF often help but effects depend on instruction style and domain; retrieval strongly reduces hallucinations for smaller models; sampling, quantization, and self-reflection can either help or hurt depending on model size and domain. Code and data released.

Problem Statement

LLMs often produce believable but false statements (factual hallucinations). We need a reliable way to measure hallucination, understand which training/use factors cause it, and test common fixes across domains and models.

Main Contribution

HaluEval 2.0: an 8,770-question benchmark spanning biomedicine, finance, science, education, and open domain for factual-hallucination evaluation.

A simple, automatic GPT-4-based detection pipeline: extract factual statements from responses and judge them (True/False/Unknown).

A systematic empirical study tracing hallucination sources across pretraining, supervised fine-tuning, prompting, and inference.

A broad empirical comparison of mitigation strategies: RLHF, retrieval augmentation, self-reflexion, decoding methods, quantization effects, and prompt improvements.

Open release of code and data for replication and follow-up.

Key Findings

The GPT-4 based two-step detector (fact extraction + fact judgement) matches human labels at high rates.

NumbersAgreement 91.5%–94.7% across five domains

Pretraining more tokens gives only marginal and unstable reduction in hallucination; domain-specific pretraining helps the targeted domain significantly.

NumbersBaichuan 2 checkpoints (0.2→2.4T tokens) showed oscillating hallucination rates; models trained on scientific corpora (e

Entity frequency in pretraining correlates with hallucination: frequent entities produce far fewer hallucinations.

NumbersTop-frequency group (≈10% of entities) had the lowest hallucination rates; long-tail entities showed much higher rates

Instruction tuning type and complexity affect hallucinations: daily-chat style lowers hallucinations; overly complex or poorly balanced synthetic instructions raise them.

NumbersLLaMA 7B tuned on ShareGPT (daily-chat) MaHR lower than FLAN-T5 (task-focused) in several domains (see Table 5)

RLHF reduces hallucinations but effect is domain-dependent.

NumbersAlpaca 7B open-domain MaHR 65.34 → 55.29 after RLHF (≈10-point drop); other domains see smaller gains (Table 10)

Retrieval augmentation substantially reduces hallucinations, especially for smaller models.

NumbersChatGPT biomed MaHR 48.75 → 23.98; Llama 2-Chat 7B biomed MaHR 69.12 → 45.13 (Table 11)

Decoding and generation choices change hallucination patterns: diversity sampling raises hallucinations in professional domains; greedy search can worsen open-ended domains; beam search often balances both.

NumbersLlama 2-Chat 7B science MaHR greedy 49.25 → top-p 50.20 (MiHR 14.05→15.13); open-domain greedy MaHR 77.35 (Table 7)

Quantization can increase hallucinations; 8-bit has small impact but 4-bit often harms factuality.

NumbersLlama 2-Chat 7B MaHR INT16 69.12 → INT8 76.84 (+7.7) and INT4 76.16 (+7.0) in biomedicine examples (Table 8)

Self-reflexion helps only large models; small models' reflection can degrade factuality.

NumbersOnly Llama 70B showed hallucination reduction with self-reflexion; 7B and 13B got worse (Section 6.3, Figure 8)

Prompt improvements (detailed task desc, in-context examples, chain-of-thought) help inconsistently and are model-dependent.

NumbersChatGPT biomed MaHR base 48.75 → manual demo 42.71 (-6.0); Llama 2-Chat 7B domain info sometimes increased MaHR (Table 6

Results

Detector human agreement

Value91.5%–94.7% per domain

Retrieval effect (MaHR)

ValueChatGPT biomed 48.75 → 23.98; Llama2-Chat7B biomed 69.12 → 45.13

Baselineno retrieval

RLHF effect (MaHR open domain)

ValueAlpaca 7B open 65.34 → 55.29

Baselineunaligned Alpaca 7B

Quantization impact (MaHR)

ValueLlama2-Chat7B biomed INT16 69.12 → INT8 76.84 → INT4 76.16

BaselineINT16 (original)

Decoding sensitivity (science MiHR)

ValueLlama2-Chat7B science MiHR greedy 14.05 → top-p 15.13

Baselinegreedy search

Instruction style effect (MaHR)

ValueLLaMA 7B ShareGPT (daily-chat) biomed MaHR 66.11 vs FLAN-T5 (task) 73.12

BaselineFLAN-T5

Who Should Care

What To Try In 7 Days

Add top-2 document retrieval snippets into prompts for fact questions and measure hallucination drop.

Run the paper's two-step detection (extract facts, judge with a strong LLM) to audit existing LLM outputs.

If using quantized models, compare INT8 vs INT16 factuality on a domain sample before deployment.

Agent Features

Tool Use

  • retrieval (Bing snippets)
  • RLHF (PPO)

Optimization Features

Infra Optimization

  • quantization for memory speed trade-offs

Training Optimization

  • RLHF (PPO reward fine-tuning)
  • instruction tuning mixes

Inference Optimization

  • quantization (bitsandbytes 4/8-bit)
  • advanced decoding (greedy-nucleus, factual-nucleus)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Pre-training and SFT analysis is limited by lack of full training details and compute to train from scratch.
  • The detection method uses GPT-4 as judge and may inherit its biases or errors.
  • Experiments focus on selected open-source and closed-source models and a curated hallucination-heavy dataset; real-world rates may differ.
  • No new mitigation algorithm is proposed; work is empirical comparison and guidance.

When Not To Use

  • Do not generalize these quantitative numbers to casual everyday chat—the dataset is curated for hallucination evaluation.
  • Avoid applying self-reflexion loops to small models without testing; paper shows harm for <70B scale.
  • Do not assume 4-bit quantization is safe for fact-critical applications without re-evaluation.

Failure Modes

  • LLM-based detector may mislabel facts when GPT-4 lacks up-to-date knowledge or shows bias.
  • Retrieval with low-relevance documents can increase hallucination by adding noise.
  • Aggressive quantization (4-bit) can materially increase factual errors in sensitive domains.
  • Prompt or CoT improvements can backfire on smaller or weaker models and increase hallucinations.

Core Entities

Models

  • ChatGPT
  • Claude
  • Claude 2
  • text-davinci-002
  • text-davinci-003
  • Alpaca 7B
  • Vicuna 7B
  • Vicuna 13B
  • YuLan-Chat 13B
  • Llama 2-Chat 7B
  • Llama 2-Chat 13B
  • Falcon 40B
  • Galactica 30B
  • GPT-NeoX 20B
  • Baichuan 2
  • Llama 2-Chat 70B

Metrics

  • MaHR
  • MiHR
  • BERTScore

Datasets

  • HaluEval 2.0
  • HaluEval
  • BioASQ
  • NFCorpus
  • FiQA-2018
  • SciFact
  • LearningQ (TED-Ed)
  • HotpotQA
  • Wikipedia

Benchmarks

  • HaluEval 2.0
  • HaluEval