FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

April 24, 20248 min

Overview

Decision SnapshotNeeds Validation

The benchmark is well-designed and validated (2k human-checked subset). Results are robust across 31 models and multiple controlled tests, but conclusions are limited to closed-book factual recall and rely on the proxy for model-known vs unknown facts.

Citations0

Evidence Strength0.85

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Links

Abstract / PDF / Data

Why It Matters For Business

FACT-BENCH reveals real-world gaps in LLM factual recall: larger models help, but instruction tuning and bad exemplars can reduce factual accuracy—so test models on verified QA sets and avoid training or prompting with incorrect facts.

Who Should Care

Summary TLDR

The authors build FACT-BENCH, a 20k closed-book QA benchmark (20 domains, 134 property types, entities/dates/numbers) to test what facts LLMs recall from pretraining. They evaluate 31 models and find: larger models recall more facts; instruction-tuning often reduces factual recall; counterfactual in-context examples that contradict a model's existing knowledge can sharply break large models; and fine-tuning on knowledge the model already knows helps, while fine-tuning on unknown facts harms factuality. The benchmark and controlled tests expose practical risks when using exemplars or mixed/unknown training data.

Problem Statement

We need a broad, clean test of whether LLMs actually remember factual knowledge learned in pretraining. Existing benchmarks use limited properties or rigid templates and leave questions about: coverage across domains and answer types, effects of instruction-tuning, sensitivity to in-context exemplars that conflict with model memory, and how fine-tuning on known vs unknown facts changes factuality.

Main Contribution

FACT-BENCH: a 20K closed-book QA benchmark covering 20 domains, 134 properties, and 3 answer types (entities, dates, numbers).

A large-scale evaluation of 31 models across 10 families, showing scaling benefits and instruction-tuning costs on factual recall.

Key Findings

Model scale improves factual recall.

NumbersGPT-4 10-shot EM 65.90% vs GPT-3.5 10-shot EM 53.55%+12.35)

Practical UsePrefer larger models when factual recall matters; expect double-digit EM gains moving from GPT-3.5 to GPT-4 on this benchmark.

Evidence RefTable 1 (10-shot EM)

Instruction-tuned models often score lower on factual recall than their pretraining-only versions.

NumbersLLaMA-33B 10-shot EM 48.90% vs Vicuna-33B 10-shot EM 44.00%-4.9)

Practical UseIf you need raw factual memory, test both pretraining-only and instruction-tuned variants; alignment can reduce exact-match accuracy.

Evidence RefTable 1 (10-shot EM)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
10-shot Exact Match (EM)GPT-4 65.90%PREMIUM2K (2k validated subset)Table 1 (10-shot EM)Table 1
10-shot Exact Match (EM)GPT-3.5-turbo 53.55%PREMIUM2KTable 1 (10-shot EM)Table 1

What To Try In 7 Days

Run your model on a subset of FACT-BENCH or PREMIUM2K to measure factual EM and 'Contains' metrics.

Audit few-shot exemplars: remove or correct any exemplar that contradicts known facts the model answers correctly.

If fine-tuning, create a ‘known-good’ subset (examples the model already answers correctly) and prioritize it over unknown/mixed data.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

planned public release of FACT-BENCH (paper states intent)

Risks & Boundaries

Limitations

Model selection limited to representative and available models at test time; not exhaustive (Section 8).

Known vs unknown knowledge approximated by whether the model answered correctly; this proxy can misclassify edge cases (Section 8).

When Not To Use

To evaluate retrieval-augmented systems or open-book QA.

For complex multi-hop or reasoning-heavy tasks where recall is not the bottleneck.

Failure Modes

Instruction tuning reduces exact-match factual recall for some families.

Feeding counterfactual exemplars can teach models to repeat false answers.

Core Entities

Models

GPT-4GPT-3.5-turboLLaMA-7BLLaMA-13BLLaMA-33BLLaMA-65BVicuna-7BVicuna-13BVicuna-33BBLOOM-7BBLOOMZ-7BFLAN-T5-XXLT0++UL2FLAN-UL2Falcon-7BFalcon-40BFalcon-180BMPT-7BMPT-30BPythia-6.9BPythia-12BMistral-7B

Metrics

Exact Match (EM)F1Contains (substring match)

Datasets

FACT-BENCHPREMIUM2K (2k human-validated subset)

Benchmarks

FACT-BENCH (20k closed-book QA)