Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
29
Why It Matters For Business
Models can produce believable but false facts. That creates risk for customer-facing apps, search, and decision tools. HaluEval lets you measure how often your model fabricates facts and whether it can flag them.
Summary TLDR
HaluEval is a 35,000-sample benchmark (5k human-annotated ChatGPT responses + 30k automatically generated hallucinated examples across QA, dialogue, and summarization). The authors show many top LLMs both produce hallucinations and struggle to detect them. Supplying retrieved facts helps detection; contrasting against ground truth makes detection worse.
Problem Statement
Large language models sometimes produce plausible but false statements (hallucinations). There is no large, multi-task benchmark that tests when and how models hallucinate and whether models can recognize their own hallucinations.
Main Contribution
A publicly released benchmark, HaluEval, with 35,000 examples: 5,000 human-annotated ChatGPT responses and 30,000 auto-generated hallucinated counterparts across QA, knowledge-grounded dialogue, and summarization.
A two-stage automatic generation pipeline (sampling-then-filtering) that uses ChatGPT to create diverse hallucinated samples and then selects the most plausible/difficult examples.
Human labeling protocol: three annotators per ChatGPT response, max-vote labels, and span-level hallucination markup; annotator agreement Fleiss' Kappa = 0.811.
Baseline experiments evaluating 10+ LLMs on the benchmark and simple improvement tests: retrieval augmentation, chain-of-thought, and contrastive examples.
Key Findings
ChatGPT produces unverifiable or conflicting statements in a sizable fraction of real user responses.
Even strong LLMs struggle to tell factual vs hallucinated outputs on HaluEval.
Retrieving and giving relevant external facts improves a model's ability to spot hallucinations.
Contrastive presentation of ground-truth plus hallucinated examples can confuse LLMs and hurt detection.
Results
Accuracy
Accuracy
Accuracy
ChatGPT hallucination rate (human-annotated user queries)
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run HaluEval on your model to get a baseline hallucination-detection accuracy (use the provided code).
Add a lightweight retrieval step (Wikipedia or domain sources) and re-run detection; expect notable improvement for factual QA.
Scan the benchmark topic breakdown to find domain blind spots (e.g., technology, climate) and prioritize guardrails there.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Generated hallucinated samples depend on ChatGPT's ability to follow the hallucination instructions; sample quality may reflect ChatGPT biases.
- Benchmark focuses on detection of hallucination, not on diagnosing why models hallucinate or fixing generation models.
- High similarity between hallucinated and ground-truth samples may create misuse risks; authors recommend monitoring distributional use.
When Not To Use
- If you want causal analysis of why models hallucinate (this benchmark measures detection only).
- For multimodal or non-text tasks (HaluEval is text-only).
- As training data to teach models to hallucinate—samples are purposely deceptive and could be misused.
Failure Modes
- Benchmarks reflect ChatGPT's hallucination patterns because ChatGPT created many samples.
- Contrastive testing can confuse models and produce misleadingly low scores.
- Topic sensitivity: models fail more in specific domains (technology, climate, language) so aggregate accuracy can hide domain blind spots.
Core Entities
Models
- ChatGPT (gpt-3.5-turbo)
- GPT-3 (davinci)
- text-davinci-002
- text-davinci-003
- Claude
- Claude 2
- Llama 2-Chat (7B)
- Vicuna (7B)
- Alpaca (7B)
- Falcon (7B)
- ChatGLM (7B)
Metrics
- Accuracy
- BERTScore (for similarity filtering)
- Fleiss' Kappa (annotator agreement)
Datasets
- HotpotQA
- OpenDialKG
- CNN/DailyMail
- Alpaca instruction tuning dataset
Benchmarks
- HaluEval
Context Entities
Models
- Instruction-tuned models (InstructGPT)
- Open-source chat models (Vicuna, Alpaca)
Metrics
- topic clustering via LDA
Datasets
- 52K Alpaca instructions (for user queries selection)

