HaluEval: 35k test cases (human + synthetic) to measure whether LLMs spot made-up facts.

May 19, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

29

Authors

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen

Links

Abstract / PDF

Why It Matters For Business

Models can produce believable but false facts. That creates risk for customer-facing apps, search, and decision tools. HaluEval lets you measure how often your model fabricates facts and whether it can flag them.

Summary TLDR

HaluEval is a 35,000-sample benchmark (5k human-annotated ChatGPT responses + 30k automatically generated hallucinated examples across QA, dialogue, and summarization). The authors show many top LLMs both produce hallucinations and struggle to detect them. Supplying retrieved facts helps detection; contrasting against ground truth makes detection worse.

Problem Statement

Large language models sometimes produce plausible but false statements (hallucinations). There is no large, multi-task benchmark that tests when and how models hallucinate and whether models can recognize their own hallucinations.

Main Contribution

A publicly released benchmark, HaluEval, with 35,000 examples: 5,000 human-annotated ChatGPT responses and 30,000 auto-generated hallucinated counterparts across QA, knowledge-grounded dialogue, and summarization.

A two-stage automatic generation pipeline (sampling-then-filtering) that uses ChatGPT to create diverse hallucinated samples and then selects the most plausible/difficult examples.

Human labeling protocol: three annotators per ChatGPT response, max-vote labels, and span-level hallucination markup; annotator agreement Fleiss' Kappa = 0.811.

Baseline experiments evaluating 10+ LLMs on the benchmark and simple improvement tests: retrieval augmentation, chain-of-thought, and contrastive examples.

Key Findings

ChatGPT produces unverifiable or conflicting statements in a sizable fraction of real user responses.

Numbers977 of 5,000 annotated responses (19.5%)

Even strong LLMs struggle to tell factual vs hallucinated outputs on HaluEval.

NumbersChatGPT accuracy: QA 62.59%, Dialogue 72.40%, Summarization 58.53% (Table 5)

Retrieving and giving relevant external facts improves a model's ability to spot hallucinations.

NumbersChatGPT QA accuracy rises 62.59% → 76.83% (+14.24)

Contrastive presentation of ground-truth plus hallucinated examples can confuse LLMs and hurt detection.

NumbersChatGPT QA accuracy drops to 49.19% with contrast examples (Table 8)

Results

Accuracy

Value62.59%

Accuracy

Value72.40%

Accuracy

Value58.53%

ChatGPT hallucination rate (human-annotated user queries)

Value19.5%

Accuracy

Value76.83%

BaselineChatGPT QA 62.59%

Accuracy

Value49.19%

BaselineChatGPT QA 62.59%

Who Should Care

What To Try In 7 Days

Run HaluEval on your model to get a baseline hallucination-detection accuracy (use the provided code).

Add a lightweight retrieval step (Wikipedia or domain sources) and re-run detection; expect notable improvement for factual QA.

Scan the benchmark topic breakdown to find domain blind spots (e.g., technology, climate) and prioritize guardrails there.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Generated hallucinated samples depend on ChatGPT's ability to follow the hallucination instructions; sample quality may reflect ChatGPT biases.
  • Benchmark focuses on detection of hallucination, not on diagnosing why models hallucinate or fixing generation models.
  • High similarity between hallucinated and ground-truth samples may create misuse risks; authors recommend monitoring distributional use.

When Not To Use

  • If you want causal analysis of why models hallucinate (this benchmark measures detection only).
  • For multimodal or non-text tasks (HaluEval is text-only).
  • As training data to teach models to hallucinate—samples are purposely deceptive and could be misused.

Failure Modes

  • Benchmarks reflect ChatGPT's hallucination patterns because ChatGPT created many samples.
  • Contrastive testing can confuse models and produce misleadingly low scores.
  • Topic sensitivity: models fail more in specific domains (technology, climate, language) so aggregate accuracy can hide domain blind spots.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo)
  • GPT-3 (davinci)
  • text-davinci-002
  • text-davinci-003
  • Claude
  • Claude 2
  • Llama 2-Chat (7B)
  • Vicuna (7B)
  • Alpaca (7B)
  • Falcon (7B)
  • ChatGLM (7B)

Metrics

  • Accuracy
  • BERTScore (for similarity filtering)
  • Fleiss' Kappa (annotator agreement)

Datasets

  • HotpotQA
  • OpenDialKG
  • CNN/DailyMail
  • Alpaca instruction tuning dataset

Benchmarks

  • HaluEval

Context Entities

Models

  • Instruction-tuned models (InstructGPT)
  • Open-source chat models (Vicuna, Alpaca)

Metrics

  • topic clustering via LDA

Datasets

  • 52K Alpaca instructions (for user queries selection)