HaluEval: 35k test cases (human + synthetic) to measure whether LLMs spot made-up facts.

Overview

Decision SnapshotReady For Pilot

Benchmark is ready to run and useful for evaluation (production_readiness 0.7). The generation/filtering pipeline is a practical contribution (novelty 0.6). Applying retrieval has measurable benefit (evidence_strength 0.8). Some risk remains from generation bias and potential misuse.

Citations29

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Models can produce believable but false facts. That creates risk for customer-facing apps, search, and decision tools. HaluEval lets you measure how often your model fabricates facts and whether it can flag them.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Founder

Summary TLDR

HaluEval is a 35,000-sample benchmark (5k human-annotated ChatGPT responses + 30k automatically generated hallucinated examples across QA, dialogue, and summarization). The authors show many top LLMs both produce hallucinations and struggle to detect them. Supplying retrieved facts helps detection; contrasting against ground truth makes detection worse.

Problem Statement

Large language models sometimes produce plausible but false statements (hallucinations). There is no large, multi-task benchmark that tests when and how models hallucinate and whether models can recognize their own hallucinations.

Main Contribution

A publicly released benchmark, HaluEval, with 35,000 examples: 5,000 human-annotated ChatGPT responses and 30,000 auto-generated hallucinated counterparts across QA, knowledge-grounded dialogue, and summarization.

A two-stage automatic generation pipeline (sampling-then-filtering) that uses ChatGPT to create diverse hallucinated samples and then selects the most plausible/difficult examples.

Key Findings

ChatGPT produces unverifiable or conflicting statements in a sizable fraction of real user responses.

Numbers977 of 5,000 annotated responses (19.5%)

Practical UseExpect roughly one in five real ChatGPT-style responses in this sample to contain at least one hallucinated span; add checks when using models for factual tasks.

Evidence RefSection 2.3; Table 4

Even strong LLMs struggle to tell factual vs hallucinated outputs on HaluEval.

NumbersChatGPT accuracy: QA 62.59%, Dialogue 72.40%, Summarization 58.53% (Table 5)

Practical UseDo not trust LLMs to reliably detect hallucinations out of the box; accuracy can be close to random on some tasks (e.g., summarization).

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	62.59%	—	—	HaluEval QA set	Table 5 (hallucination recognition accuracy)	Table 5
Accuracy	72.40%	—	—	HaluEval Dialogue set	Table 5 (hallucination recognition accuracy)	Table 5

What To Try In 7 Days

Run HaluEval on your model to get a baseline hallucination-detection accuracy (use the provided code).

Add a lightweight retrieval step (Wikipedia or domain sources) and re-run detection; expect notable improvement for factual QA.

Scan the benchmark topic breakdown to find domain blind spots (e.g., technology, climate) and prioritize guardrails there.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/RUCAIBox/HaluEval

Data URLs

https://github.com/RUCAIBox/HaluEval

Risks & Boundaries

Limitations

Generated hallucinated samples depend on ChatGPT's ability to follow the hallucination instructions; sample quality may reflect ChatGPT biases.

Benchmark focuses on detection of hallucination, not on diagnosing why models hallucinate or fixing generation models.

When Not To Use

If you want causal analysis of why models hallucinate (this benchmark measures detection only).

For multimodal or non-text tasks (HaluEval is text-only).

Failure Modes

Benchmarks reflect ChatGPT's hallucination patterns because ChatGPT created many samples.

Contrastive testing can confuse models and produce misleadingly low scores.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)GPT-3 (davinci)text-davinci-002text-davinci-003ClaudeClaude 2Llama 2-Chat (7B)Vicuna (7B)Alpaca (7B)Falcon (7B)ChatGLM (7B)

Metrics

AccuracyBERTScore (for similarity filtering)Fleiss' Kappa (annotator agreement)

Datasets

HotpotQAOpenDialKGCNN/DailyMailAlpaca instruction tuning dataset

Benchmarks

HaluEval

Context Entities

Models

Instruction-tuned models (InstructGPT)Open-source chat models (Vicuna, Alpaca)

Metrics

topic clustering via LDA

Datasets

52K Alpaca instructions (for user queries selection)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatGPT produces unverifiable or conflicting statements in a sizable fraction of real user responses.

Even strong LLMs struggle to tell factual vs hallucinated outputs on HaluEval.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding