An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

April 8, 20247 min

Overview

Decision SnapshotNeeds Validation

The leaderboard is a practical, usable benchmark and an open dashboard. Evidence is limited to open-source models and selected prompts; findings are empirically supported but need broader prompt and closed-source model checks before high-stakes deployment.

Citations6

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, Pasquale Minervini

Links

Abstract / PDF / Code

Why It Matters For Business

Hallucinated outputs cause misinformation and user harm. The leaderboard helps pick models that are less likely to invent facts or contradict source text by benchmarking both factuality and faithfulness.

Who Should Care

Summary TLDR

This paper introduces the Hallucinations Leaderboard: an open evaluation suite that measures two types of LLM hallucinations—factuality (world knowledge errors) and faithfulness (errors vs. a given input). The leaderboard runs 15 tasks (QA, summarisation, reading comprehension, fact-checking, hallucination detection, instruction-following) on ~20 open-source LLMs using the EleutherAI Eval harness. Main findings: instruction fine-tuning tends to improve faithfulness but not reliably factuality; scaling generally helps factuality more than faithfulness; models cluster by family. The leaderboard and preliminary prompt-robustness checks are public at the Hugging Face space.

Problem Statement

Practitioners lack a consistent, multi-task way to compare how models hallucinate. With many models, sizes, and training regimes, teams need a shared benchmark to quantify hallucination risk across tasks and make model choices.

Main Contribution

A publicly available Hallucinations Leaderboard on Hugging Face that aggregates multiple hallucination benchmarks

A unified evaluation split into factuality (closed-book QA, fact-checking) and faithfulness (summarisation, reading comprehension, instruction following)

Key Findings

Instruction fine-tuning consistently increases Faithfulness scores.

Numbersup to +5.3 points on Faithfulness (OpenHermes vs base, Table 1)

Practical UseIf you need models to stick to a given input (summaries or context grounding), apply instruction tuning and re-test using faithfulness tasks.

Evidence RefTable 1

Instruction fine-tuning does not reliably improve factuality; changes vary widely by model.

NumbersVicuna +11.3 pts factuality vs Mistral-Instruct −4.7 pts (Table 1)

Practical UseDon’t assume instruction tuning fixes factual errors—measure factuality separately after tuning before deployment.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
FaithfulnessLlama-2-7b-chat 38.69Llama-2-7b 37.94+0.8Aggregate (15 tasks)Instruction-tuned Llama-2-7b-chat shows +0.8 faithfulness vs baseTable 1
FactualityVicuna-7b-v1.5 51.42Llama-2-7b 40.12+11.3Aggregate (15 tasks)Vicuna shows a large factuality improvement versus base in Table 1Table 1

What To Try In 7 Days

Run the leaderboard (Hugging Face space) on your candidate open-source models to get faithfulness and factuality baselines

If you need context grounding (summaries, QA on given docs), evaluate instruction-tuned variants as they usually increase faithfulness

Measure factuality separately; do not assume instruction tuning improves factual correctness—run a closed-book QA test set you care about first

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Fixed prompt templates for most tasks; only limited prompt robustness checks were run

Shot counts for in-context learning were not broadly explored; results may change with different numbers of demonstrations

When Not To Use

For final high-stakes factual verification without human oversight; leaderboard is diagnostic not a full safety system

If you must evaluate closed-source models; they were excluded from this work

Failure Modes

Aggregate scores can hide per-task weaknesses; a model with a decent average may fail critical tasks

Instruction tuning can make models more faithful to input but simultaneously reduce factual accuracy

Core Entities

Models

Llama-2-7bLlama-2-7b-chatLlama-2-13bLlama-2-13b-chatVicuna-7b-v1.5Mistral-7B-v0.1Mistral-7B-Instruct-v0.1Falcon-7bFalcon-7b-InstructOpenHermes-2.5-Mistral-7BZephyr-7b-betaGPT-J-6BGPT-Neo-125MGPT-Neo-1.3BGPT-Neo-2.7BBloom-560MBloom-1b7Bloom-7b1

Metrics

EMROUGE-LAccuracyMC2

Datasets

Natural QuestionsTriviaQAPopQATruthfulQA (MC2)FEVERTrue-FalseXSumCNN/DailyMailRACESQuAD v2NQ-SwapMemoTrapIFEvalFaithDialHaluEval (QA)FactKB

Benchmarks

Natural QuestionsTriviaQAPopQATruthfulQAFEVERTrue-FalseXSumCNN/DMRACESQuADv2NQ-SwapMemoTrapIFEvalFaithDialHaluEvalFactKB