HaluEval: 35k test cases (human + synthetic) to measure whether LLMs spot made-up facts.

May 19, 20237 min

Overview

Decision SnapshotReady For Pilot

Benchmark is ready to run and useful for evaluation (production_readiness 0.7). The generation/filtering pipeline is a practical contribution (novelty 0.6). Applying retrieval has measurable benefit (evidence_strength 0.8). Some risk remains from generation bias and potential misuse.

Citations29

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Models can produce believable but false facts. That creates risk for customer-facing apps, search, and decision tools. HaluEval lets you measure how often your model fabricates facts and whether it can flag them.

Who Should Care

Summary TLDR

HaluEval is a 35,000-sample benchmark (5k human-annotated ChatGPT responses + 30k automatically generated hallucinated examples across QA, dialogue, and summarization). The authors show many top LLMs both produce hallucinations and struggle to detect them. Supplying retrieved facts helps detection; contrasting against ground truth makes detection worse.

Problem Statement

Large language models sometimes produce plausible but false statements (hallucinations). There is no large, multi-task benchmark that tests when and how models hallucinate and whether models can recognize their own hallucinations.

Main Contribution

A publicly released benchmark, HaluEval, with 35,000 examples: 5,000 human-annotated ChatGPT responses and 30,000 auto-generated hallucinated counterparts across QA, knowledge-grounded dialogue, and summarization.

A two-stage automatic generation pipeline (sampling-then-filtering) that uses ChatGPT to create diverse hallucinated samples and then selects the most plausible/difficult examples.

Key Findings

ChatGPT produces unverifiable or conflicting statements in a sizable fraction of real user responses.

Numbers977 of 5,000 annotated responses (19.5%)

Practical UseExpect roughly one in five real ChatGPT-style responses in this sample to contain at least one hallucinated span; add checks when using models for factual tasks.

Evidence RefSection 2.3; Table 4

Even strong LLMs struggle to tell factual vs hallucinated outputs on HaluEval.

NumbersChatGPT accuracy: QA 62.59%, Dialogue 72.40%, Summarization 58.53% (Table 5)

Practical UseDo not trust LLMs to reliably detect hallucinations out of the box; accuracy can be close to random on some tasks (e.g., summarization).

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy62.59%HaluEval QA setTable 5 (hallucination recognition accuracy)Table 5
Accuracy72.40%HaluEval Dialogue setTable 5 (hallucination recognition accuracy)Table 5

What To Try In 7 Days

Run HaluEval on your model to get a baseline hallucination-detection accuracy (use the provided code).

Add a lightweight retrieval step (Wikipedia or domain sources) and re-run detection; expect notable improvement for factual QA.

Scan the benchmark topic breakdown to find domain blind spots (e.g., technology, climate) and prioritize guardrails there.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Generated hallucinated samples depend on ChatGPT's ability to follow the hallucination instructions; sample quality may reflect ChatGPT biases.

Benchmark focuses on detection of hallucination, not on diagnosing why models hallucinate or fixing generation models.

When Not To Use

If you want causal analysis of why models hallucinate (this benchmark measures detection only).

For multimodal or non-text tasks (HaluEval is text-only).

Failure Modes

Benchmarks reflect ChatGPT's hallucination patterns because ChatGPT created many samples.

Contrastive testing can confuse models and produce misleadingly low scores.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)GPT-3 (davinci)text-davinci-002text-davinci-003ClaudeClaude 2Llama 2-Chat (7B)Vicuna (7B)Alpaca (7B)Falcon (7B)ChatGLM (7B)

Metrics

AccuracyBERTScore (for similarity filtering)Fleiss' Kappa (annotator agreement)

Datasets

HotpotQAOpenDialKGCNN/DailyMailAlpaca instruction tuning dataset

Benchmarks

HaluEval

Context Entities

Models

Instruction-tuned models (InstructGPT)Open-source chat models (Vicuna, Alpaca)

Metrics

topic clustering via LDA

Datasets

52K Alpaca instructions (for user queries selection)