Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
Hallucinated outputs cause misinformation and user harm. The leaderboard helps pick models that are less likely to invent facts or contradict source text by benchmarking both factuality and faithfulness.
Summary TLDR
This paper introduces the Hallucinations Leaderboard: an open evaluation suite that measures two types of LLM hallucinations—factuality (world knowledge errors) and faithfulness (errors vs. a given input). The leaderboard runs 15 tasks (QA, summarisation, reading comprehension, fact-checking, hallucination detection, instruction-following) on ~20 open-source LLMs using the EleutherAI Eval harness. Main findings: instruction fine-tuning tends to improve faithfulness but not reliably factuality; scaling generally helps factuality more than faithfulness; models cluster by family. The leaderboard and preliminary prompt-robustness checks are public at the Hugging Face space.
Problem Statement
Practitioners lack a consistent, multi-task way to compare how models hallucinate. With many models, sizes, and training regimes, teams need a shared benchmark to quantify hallucination risk across tasks and make model choices.
Main Contribution
A publicly available Hallucinations Leaderboard on Hugging Face that aggregates multiple hallucination benchmarks
A unified evaluation split into factuality (closed-book QA, fact-checking) and faithfulness (summarisation, reading comprehension, instruction following)
A simple pair of aggregate scores: factuality score and faithfulness score (averages of task metrics)
An empirical sweep over ~20 open-source LLMs across 15 tasks, plus analyses of instruction fine-tuning, model scaling, prompt robustness, and family effects
Key Findings
Instruction fine-tuning consistently increases Faithfulness scores.
Instruction fine-tuning does not reliably improve factuality; changes vary widely by model.
Larger models tend to improve factuality more than faithfulness.
Models cluster by family (Llama, GPT-Neo, Bloom) rather than by fine-tuning type.
Models are better at judging (detecting) hallucinations than at producing factually correct or faithful outputs.
Prompt template variations in tested cases produced only small evaluation changes.
Results
Faithfulness
Factuality
Faithfulness / Factuality
Who Should Care
What To Try In 7 Days
Run the leaderboard (Hugging Face space) on your candidate open-source models to get faithfulness and factuality baselines
If you need context grounding (summaries, QA on given docs), evaluate instruction-tuned variants as they usually increase faithfulness
Measure factuality separately; do not assume instruction tuning improves factual correctness—run a closed-book QA test set you care about first
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Fixed prompt templates for most tasks; only limited prompt robustness checks were run
- Shot counts for in-context learning were not broadly explored; results may change with different numbers of demonstrations
- Only open-source models were evaluated; closed-source models (e.g. GPT-4) were not included
- Some datasets can contain demographic or collection biases that affect reported scores
When Not To Use
- For final high-stakes factual verification without human oversight; leaderboard is diagnostic not a full safety system
- If you must evaluate closed-source models; they were excluded from this work
- If your application uses non-standard prompts or many-shot setups not covered by the templates
Failure Modes
- Aggregate scores can hide per-task weaknesses; a model with a decent average may fail critical tasks
- Instruction tuning can make models more faithful to input but simultaneously reduce factual accuracy
- Prompt sensitivity outside tested paraphrases may flip results
- Dataset biases in benchmarks can misrepresent real-world performance
Core Entities
Models
- Llama-2-7b
- Llama-2-7b-chat
- Llama-2-13b
- Llama-2-13b-chat
- Vicuna-7b-v1.5
- Mistral-7B-v0.1
- Mistral-7B-Instruct-v0.1
- Falcon-7b
- Falcon-7b-Instruct
- OpenHermes-2.5-Mistral-7B
- Zephyr-7b-beta
- GPT-J-6B
- GPT-Neo-125M
- GPT-Neo-1.3B
- GPT-Neo-2.7B
- Bloom-560M
- Bloom-1b7
- Bloom-7b1
Metrics
- EM
- ROUGE-L
- Accuracy
- MC2
Datasets
- Natural Questions
- TriviaQA
- PopQA
- TruthfulQA (MC2)
- FEVER
- True-False
- XSum
- CNN/DailyMail
- RACE
- SQuAD v2
- NQ-Swap
- MemoTrap
- IFEval
- FaithDial
- HaluEval (QA)
- FactKB
Benchmarks
- Natural Questions
- TriviaQA
- PopQA
- TruthfulQA
- FEVER
- True-False
- XSum
- CNN/DM
- RACE
- SQuADv2
- NQ-Swap
- MemoTrap
- IFEval
- FaithDial
- HaluEval
- FactKB

