An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

April 8, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

6

Authors

Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, Pasquale Minervini

Links

Abstract / PDF

Why It Matters For Business

Hallucinated outputs cause misinformation and user harm. The leaderboard helps pick models that are less likely to invent facts or contradict source text by benchmarking both factuality and faithfulness.

Summary TLDR

This paper introduces the Hallucinations Leaderboard: an open evaluation suite that measures two types of LLM hallucinations—factuality (world knowledge errors) and faithfulness (errors vs. a given input). The leaderboard runs 15 tasks (QA, summarisation, reading comprehension, fact-checking, hallucination detection, instruction-following) on ~20 open-source LLMs using the EleutherAI Eval harness. Main findings: instruction fine-tuning tends to improve faithfulness but not reliably factuality; scaling generally helps factuality more than faithfulness; models cluster by family. The leaderboard and preliminary prompt-robustness checks are public at the Hugging Face space.

Problem Statement

Practitioners lack a consistent, multi-task way to compare how models hallucinate. With many models, sizes, and training regimes, teams need a shared benchmark to quantify hallucination risk across tasks and make model choices.

Main Contribution

A publicly available Hallucinations Leaderboard on Hugging Face that aggregates multiple hallucination benchmarks

A unified evaluation split into factuality (closed-book QA, fact-checking) and faithfulness (summarisation, reading comprehension, instruction following)

A simple pair of aggregate scores: factuality score and faithfulness score (averages of task metrics)

An empirical sweep over ~20 open-source LLMs across 15 tasks, plus analyses of instruction fine-tuning, model scaling, prompt robustness, and family effects

Key Findings

Instruction fine-tuning consistently increases Faithfulness scores.

Numbersup to +5.3 points on Faithfulness (OpenHermes vs base, Table 1)

Instruction fine-tuning does not reliably improve factuality; changes vary widely by model.

NumbersVicuna +11.3 pts factuality vs Mistral-Instruct −4.7 pts (Table 1)

Larger models tend to improve factuality more than faithfulness.

NumbersFactuality gains up to +5.6 pts across sizes (Table 2)

Models cluster by family (Llama, GPT-Neo, Bloom) rather than by fine-tuning type.

Models are better at judging (detecting) hallucinations than at producing factually correct or faithful outputs.

Prompt template variations in tested cases produced only small evaluation changes.

NumbersTruthfulQA std ≈ 0.01; NQ changes small (Table 3)

Results

Faithfulness

ValueLlama-2-7b-chat 38.69

BaselineLlama-2-7b 37.94

Factuality

ValueVicuna-7b-v1.5 51.42

BaselineLlama-2-7b 40.12

Faithfulness / Factuality

ValueMistral-7B-Instruct Faithfulness 43.26 | Factuality 50.74

BaselineMistral-7B Faithfulness 38.62 | Factuality 55.41

Who Should Care

What To Try In 7 Days

Run the leaderboard (Hugging Face space) on your candidate open-source models to get faithfulness and factuality baselines

If you need context grounding (summaries, QA on given docs), evaluate instruction-tuned variants as they usually increase faithfulness

Measure factuality separately; do not assume instruction tuning improves factual correctness—run a closed-book QA test set you care about first

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Fixed prompt templates for most tasks; only limited prompt robustness checks were run
  • Shot counts for in-context learning were not broadly explored; results may change with different numbers of demonstrations
  • Only open-source models were evaluated; closed-source models (e.g. GPT-4) were not included
  • Some datasets can contain demographic or collection biases that affect reported scores

When Not To Use

  • For final high-stakes factual verification without human oversight; leaderboard is diagnostic not a full safety system
  • If you must evaluate closed-source models; they were excluded from this work
  • If your application uses non-standard prompts or many-shot setups not covered by the templates

Failure Modes

  • Aggregate scores can hide per-task weaknesses; a model with a decent average may fail critical tasks
  • Instruction tuning can make models more faithful to input but simultaneously reduce factual accuracy
  • Prompt sensitivity outside tested paraphrases may flip results
  • Dataset biases in benchmarks can misrepresent real-world performance

Core Entities

Models

  • Llama-2-7b
  • Llama-2-7b-chat
  • Llama-2-13b
  • Llama-2-13b-chat
  • Vicuna-7b-v1.5
  • Mistral-7B-v0.1
  • Mistral-7B-Instruct-v0.1
  • Falcon-7b
  • Falcon-7b-Instruct
  • OpenHermes-2.5-Mistral-7B
  • Zephyr-7b-beta
  • GPT-J-6B
  • GPT-Neo-125M
  • GPT-Neo-1.3B
  • GPT-Neo-2.7B
  • Bloom-560M
  • Bloom-1b7
  • Bloom-7b1

Metrics

  • EM
  • ROUGE-L
  • Accuracy
  • MC2

Datasets

  • Natural Questions
  • TriviaQA
  • PopQA
  • TruthfulQA (MC2)
  • FEVER
  • True-False
  • XSum
  • CNN/DailyMail
  • RACE
  • SQuAD v2
  • NQ-Swap
  • MemoTrap
  • IFEval
  • FaithDial
  • HaluEval (QA)
  • FactKB

Benchmarks

  • Natural Questions
  • TriviaQA
  • PopQA
  • TruthfulQA
  • FEVER
  • True-False
  • XSum
  • CNN/DM
  • RACE
  • SQuADv2
  • NQ-Swap
  • MemoTrap
  • IFEval
  • FaithDial
  • HaluEval
  • FactKB