Overview
The leaderboard is a practical, usable benchmark and an open dashboard. Evidence is limited to open-source models and selected prompts; findings are empirically supported but need broader prompt and closed-source model checks before high-stakes deployment.
Citations6
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Hallucinated outputs cause misinformation and user harm. The leaderboard helps pick models that are less likely to invent facts or contradict source text by benchmarking both factuality and faithfulness.
Who Should Care
Summary TLDR
This paper introduces the Hallucinations Leaderboard: an open evaluation suite that measures two types of LLM hallucinations—factuality (world knowledge errors) and faithfulness (errors vs. a given input). The leaderboard runs 15 tasks (QA, summarisation, reading comprehension, fact-checking, hallucination detection, instruction-following) on ~20 open-source LLMs using the EleutherAI Eval harness. Main findings: instruction fine-tuning tends to improve faithfulness but not reliably factuality; scaling generally helps factuality more than faithfulness; models cluster by family. The leaderboard and preliminary prompt-robustness checks are public at the Hugging Face space.
Problem Statement
Practitioners lack a consistent, multi-task way to compare how models hallucinate. With many models, sizes, and training regimes, teams need a shared benchmark to quantify hallucination risk across tasks and make model choices.
Main Contribution
A publicly available Hallucinations Leaderboard on Hugging Face that aggregates multiple hallucination benchmarks
A unified evaluation split into factuality (closed-book QA, fact-checking) and faithfulness (summarisation, reading comprehension, instruction following)
Key Findings
Instruction fine-tuning consistently increases Faithfulness scores.
Instruction fine-tuning does not reliably improve factuality; changes vary widely by model.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Faithfulness | Llama-2-7b-chat 38.69 | Llama-2-7b 37.94 | +0.8 | Aggregate (15 tasks) | Instruction-tuned Llama-2-7b-chat shows +0.8 faithfulness vs base | Table 1 |
| Factuality | Vicuna-7b-v1.5 51.42 | Llama-2-7b 40.12 | +11.3 | Aggregate (15 tasks) | Vicuna shows a large factuality improvement versus base in Table 1 | Table 1 |
What To Try In 7 Days
Run the leaderboard (Hugging Face space) on your candidate open-source models to get faithfulness and factuality baselines
If you need context grounding (summaries, QA on given docs), evaluate instruction-tuned variants as they usually increase faithfulness
Measure factuality separately; do not assume instruction tuning improves factual correctness—run a closed-book QA test set you care about first
Reproducibility
Risks & Boundaries
Limitations
Fixed prompt templates for most tasks; only limited prompt robustness checks were run
Shot counts for in-context learning were not broadly explored; results may change with different numbers of demonstrations
When Not To Use
For final high-stakes factual verification without human oversight; leaderboard is diagnostic not a full safety system
If you must evaluate closed-source models; they were excluded from this work
Failure Modes
Aggregate scores can hide per-task weaknesses; a model with a decent average may fail critical tasks
Instruction tuning can make models more faithful to input but simultaneously reduce factual accuracy

