An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Overview

Decision SnapshotNeeds Validation

The leaderboard is a practical, usable benchmark and an open dashboard. Evidence is limited to open-source models and selected prompts; findings are empirically supported but need broader prompt and closed-source model checks before high-stakes deployment.

Citations6

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, Pasquale Minervini

Links

Abstract / PDF / Code

Why It Matters For Business

Hallucinated outputs cause misinformation and user harm. The leaderboard helps pick models that are less likely to invent facts or contradict source text by benchmarking both factuality and faithfulness.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

This paper introduces the Hallucinations Leaderboard: an open evaluation suite that measures two types of LLM hallucinations—factuality (world knowledge errors) and faithfulness (errors vs. a given input). The leaderboard runs 15 tasks (QA, summarisation, reading comprehension, fact-checking, hallucination detection, instruction-following) on ~20 open-source LLMs using the EleutherAI Eval harness. Main findings: instruction fine-tuning tends to improve faithfulness but not reliably factuality; scaling generally helps factuality more than faithfulness; models cluster by family. The leaderboard and preliminary prompt-robustness checks are public at the Hugging Face space.

Problem Statement

Practitioners lack a consistent, multi-task way to compare how models hallucinate. With many models, sizes, and training regimes, teams need a shared benchmark to quantify hallucination risk across tasks and make model choices.

Main Contribution

A publicly available Hallucinations Leaderboard on Hugging Face that aggregates multiple hallucination benchmarks

A unified evaluation split into factuality (closed-book QA, fact-checking) and faithfulness (summarisation, reading comprehension, instruction following)

Key Findings

Instruction fine-tuning consistently increases Faithfulness scores.

Numbersup to +5.3 points on Faithfulness (OpenHermes vs base, Table 1)

Practical UseIf you need models to stick to a given input (summaries or context grounding), apply instruction tuning and re-test using faithfulness tasks.

Evidence RefTable 1

Instruction fine-tuning does not reliably improve factuality; changes vary widely by model.

NumbersVicuna +11.3 pts factuality vs Mistral-Instruct −4.7 pts (Table 1)

Practical UseDon’t assume instruction tuning fixes factual errors—measure factuality separately after tuning before deployment.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Faithfulness	Llama-2-7b-chat 38.69	Llama-2-7b 37.94	+0.8	Aggregate (15 tasks)	Instruction-tuned Llama-2-7b-chat shows +0.8 faithfulness vs base	Table 1
Factuality	Vicuna-7b-v1.5 51.42	Llama-2-7b 40.12	+11.3	Aggregate (15 tasks)	Vicuna shows a large factuality improvement versus base in Table 1	Table 1

What To Try In 7 Days

Run the leaderboard (Hugging Face space) on your candidate open-source models to get faithfulness and factuality baselines

If you need context grounding (summaries, QA on given docs), evaluate instruction-tuned variants as they usually increase faithfulness

Measure factuality separately; do not assume instruction tuning improves factual correctness—run a closed-book QA test set you care about first

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard

Risks & Boundaries

Limitations

Fixed prompt templates for most tasks; only limited prompt robustness checks were run

Shot counts for in-context learning were not broadly explored; results may change with different numbers of demonstrations

When Not To Use

For final high-stakes factual verification without human oversight; leaderboard is diagnostic not a full safety system

If you must evaluate closed-source models; they were excluded from this work

Failure Modes

Aggregate scores can hide per-task weaknesses; a model with a decent average may fail critical tasks

Instruction tuning can make models more faithful to input but simultaneously reduce factual accuracy

Core Entities

Models

Llama-2-7bLlama-2-7b-chatLlama-2-13bLlama-2-13b-chatVicuna-7b-v1.5Mistral-7B-v0.1Mistral-7B-Instruct-v0.1Falcon-7bFalcon-7b-InstructOpenHermes-2.5-Mistral-7BZephyr-7b-betaGPT-J-6BGPT-Neo-125MGPT-Neo-1.3BGPT-Neo-2.7BBloom-560MBloom-1b7Bloom-7b1

Metrics

EMROUGE-LAccuracyMC2

Datasets

Natural QuestionsTriviaQAPopQATruthfulQA (MC2)FEVERTrue-FalseXSumCNN/DailyMailRACESQuAD v2NQ-SwapMemoTrapIFEvalFaithDialHaluEval (QA)FactKB

Benchmarks

Natural QuestionsTriviaQAPopQATruthfulQAFEVERTrue-FalseXSumCNN/DMRACESQuADv2NQ-SwapMemoTrapIFEvalFaithDialHaluEvalFactKB

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction fine-tuning consistently increases Faithfulness scores.

Instruction fine-tuning does not reliably improve factuality; changes vary widely by model.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding