FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

April 24, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

0

Authors

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Links

Abstract / PDF

Why It Matters For Business

FACT-BENCH reveals real-world gaps in LLM factual recall: larger models help, but instruction tuning and bad exemplars can reduce factual accuracy—so test models on verified QA sets and avoid training or prompting with incorrect facts.

Summary TLDR

The authors build FACT-BENCH, a 20k closed-book QA benchmark (20 domains, 134 property types, entities/dates/numbers) to test what facts LLMs recall from pretraining. They evaluate 31 models and find: larger models recall more facts; instruction-tuning often reduces factual recall; counterfactual in-context examples that contradict a model's existing knowledge can sharply break large models; and fine-tuning on knowledge the model already knows helps, while fine-tuning on unknown facts harms factuality. The benchmark and controlled tests expose practical risks when using exemplars or mixed/unknown training data.

Problem Statement

We need a broad, clean test of whether LLMs actually remember factual knowledge learned in pretraining. Existing benchmarks use limited properties or rigid templates and leave questions about: coverage across domains and answer types, effects of instruction-tuning, sensitivity to in-context exemplars that conflict with model memory, and how fine-tuning on known vs unknown facts changes factuality.

Main Contribution

FACT-BENCH: a 20K closed-book QA benchmark covering 20 domains, 134 properties, and 3 answer types (entities, dates, numbers).

A large-scale evaluation of 31 models across 10 families, showing scaling benefits and instruction-tuning costs on factual recall.

Controlled counterfactual in-context learning (ICL) experiments that isolate when exemplars damage factual recall.

Fine-tuning experiments that compare training on examples the model already knows vs unknown or mixed examples.

Key Findings

Model scale improves factual recall.

NumbersGPT-4 10-shot EM 65.90% vs GPT-3.5 10-shot EM 53.55% (Δ +12.35)

Instruction-tuned models often score lower on factual recall than their pretraining-only versions.

NumbersLLaMA-33B 10-shot EM 48.90% vs Vicuna-33B 10-shot EM 44.00% (Δ -4.9)

Counterfactual in-context exemplars that contradict a model's known facts can cause large drops in recall.

NumbersLLaMA-65B 10-shot EM drops 52.45% → 29.45% (Δ -23.0); Falcon-180B 53.45% → 37.05% (Δ -16.4)

Degradation from counterfactual exemplars is strongest when exemplars contradict model-known facts and grows with the number of such exemplars.

NumbersKnown-shuffle LLaMA-65B 10-shot 52.45% → 26.60% vs unknown-shuffle 52.45% → 42.90%; larger k magnifies the gap

Fine-tuning on facts the model already knows improves factual recall; fine-tuning on unknown facts hurts and counterfactual fine-tuning is damaging.

NumbersFine-tune known EM 33.00% vs unknown EM 27.55% (Table 4); counterfactual fine-tune EM 10.75% vs regular fine-tune 28.75%

Even best model has a large gap to human-validated upper-bound.

NumbersGPT-4 10-shot EM 65.90% vs PREMIUM2K human upper-bound ≈ 90%

Results

10-shot Exact Match (EM)

ValueGPT-4 65.90%

10-shot Exact Match (EM)

ValueGPT-3.5-turbo 53.55%

10-shot Exact Match (EM)

ValueLLaMA-65B 52.45%

10-shot Exact Match (EM)

ValueLLaMA-33B 48.90%

Effect of counterfactual exemplars (10-shot EM drop)

ValueLLaMA-65B −23.0 pp

Baselineregular 10-shot EM 52.45%

Fine-tune known vs unknown (zero-shot EM)

ValueKnown 33.00% vs Unknown 27.55%

Human upper-bound (PREMIUM2K)

Value≈ 90% EM

Who Should Care

What To Try In 7 Days

Run your model on a subset of FACT-BENCH or PREMIUM2K to measure factual EM and 'Contains' metrics.

Audit few-shot exemplars: remove or correct any exemplar that contradicts known facts the model answers correctly.

If fine-tuning, create a ‘known-good’ subset (examples the model already answers correctly) and prioritize it over unknown/mixed data.

Reproducibility

Data Urls

  • planned public release of FACT-BENCH (paper states intent)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Model selection limited to representative and available models at test time; not exhaustive (Section 8).
  • Known vs unknown knowledge approximated by whether the model answered correctly; this proxy can misclassify edge cases (Section 8).
  • Benchmark is closed-book: it measures parametric recall, not retrieval-augmented or multi-hop reasoning.

When Not To Use

  • To evaluate retrieval-augmented systems or open-book QA.
  • For complex multi-hop or reasoning-heavy tasks where recall is not the bottleneck.
  • When pretraining corpora differ sharply from Wikipedia-based grounding assumptions.

Failure Modes

  • Instruction tuning reduces exact-match factual recall for some families.
  • Feeding counterfactual exemplars can teach models to repeat false answers.
  • Fine-tuning on unknown or corrupted facts increases hallucinations.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-33B
  • LLaMA-65B
  • Vicuna-7B
  • Vicuna-13B
  • Vicuna-33B
  • BLOOM-7B
  • BLOOMZ-7B
  • FLAN-T5-XXL
  • T0++
  • UL2
  • FLAN-UL2
  • Falcon-7B
  • Falcon-40B
  • Falcon-180B
  • MPT-7B
  • MPT-30B
  • Pythia-6.9B
  • Pythia-12B
  • Mistral-7B

Metrics

  • Exact Match (EM)
  • F1
  • Contains (substring match)

Datasets

  • FACT-BENCH
  • PREMIUM2K (2k human-validated subset)

Benchmarks

  • FACT-BENCH (20k closed-book QA)