FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Overview

Decision SnapshotNeeds Validation

The benchmark is well-designed and validated (2k human-checked subset). Results are robust across 31 models and multiple controlled tests, but conclusions are limited to closed-book factual recall and rely on the proxy for model-known vs unknown facts.

Citations0

Evidence Strength0.85

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Links

Abstract / PDF / Data

Why It Matters For Business

FACT-BENCH reveals real-world gaps in LLM factual recall: larger models help, but instruction tuning and bad exemplars can reduce factual accuracy—so test models on verified QA sets and avoid training or prompting with incorrect facts.

Who Should Care

ML Engineer Product Manager Data Scientist CTO

Summary TLDR

The authors build FACT-BENCH, a 20k closed-book QA benchmark (20 domains, 134 property types, entities/dates/numbers) to test what facts LLMs recall from pretraining. They evaluate 31 models and find: larger models recall more facts; instruction-tuning often reduces factual recall; counterfactual in-context examples that contradict a model's existing knowledge can sharply break large models; and fine-tuning on knowledge the model already knows helps, while fine-tuning on unknown facts harms factuality. The benchmark and controlled tests expose practical risks when using exemplars or mixed/unknown training data.

Problem Statement

We need a broad, clean test of whether LLMs actually remember factual knowledge learned in pretraining. Existing benchmarks use limited properties or rigid templates and leave questions about: coverage across domains and answer types, effects of instruction-tuning, sensitivity to in-context exemplars that conflict with model memory, and how fine-tuning on known vs unknown facts changes factuality.

Main Contribution

FACT-BENCH: a 20K closed-book QA benchmark covering 20 domains, 134 properties, and 3 answer types (entities, dates, numbers).

A large-scale evaluation of 31 models across 10 families, showing scaling benefits and instruction-tuning costs on factual recall.

Key Findings

Model scale improves factual recall.

NumbersGPT-4 10-shot EM 65.90% vs GPT-3.5 10-shot EM 53.55% (Δ +12.35)

Practical UsePrefer larger models when factual recall matters; expect double-digit EM gains moving from GPT-3.5 to GPT-4 on this benchmark.

Evidence RefTable 1 (10-shot EM)

Instruction-tuned models often score lower on factual recall than their pretraining-only versions.

NumbersLLaMA-33B 10-shot EM 48.90% vs Vicuna-33B 10-shot EM 44.00% (Δ -4.9)

Practical UseIf you need raw factual memory, test both pretraining-only and instruction-tuned variants; alignment can reduce exact-match accuracy.

Evidence RefTable 1 (10-shot EM)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
10-shot Exact Match (EM)	GPT-4 65.90%	—	—	PREMIUM2K (2k validated subset)	Table 1 (10-shot EM)	Table 1
10-shot Exact Match (EM)	GPT-3.5-turbo 53.55%	—	—	PREMIUM2K	Table 1 (10-shot EM)	Table 1

What To Try In 7 Days

Run your model on a subset of FACT-BENCH or PREMIUM2K to measure factual EM and 'Contains' metrics.

Audit few-shot exemplars: remove or correct any exemplar that contradicts known facts the model answers correctly.

If fine-tuning, create a ‘known-good’ subset (examples the model already answers correctly) and prioritize it over unknown/mixed data.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

planned public release of FACT-BENCH (paper states intent)

Risks & Boundaries

Limitations

Model selection limited to representative and available models at test time; not exhaustive (Section 8).

Known vs unknown knowledge approximated by whether the model answered correctly; this proxy can misclassify edge cases (Section 8).

When Not To Use

To evaluate retrieval-augmented systems or open-book QA.

For complex multi-hop or reasoning-heavy tasks where recall is not the bottleneck.

Failure Modes

Instruction tuning reduces exact-match factual recall for some families.

Feeding counterfactual exemplars can teach models to repeat false answers.

Core Entities

Models

GPT-4GPT-3.5-turboLLaMA-7BLLaMA-13BLLaMA-33BLLaMA-65BVicuna-7BVicuna-13BVicuna-33BBLOOM-7BBLOOMZ-7BFLAN-T5-XXLT0++UL2FLAN-UL2Falcon-7BFalcon-40BFalcon-180BMPT-7BMPT-30BPythia-6.9BPythia-12BMistral-7B

Metrics

Exact Match (EM)F1Contains (substring match)

Datasets

FACT-BENCHPREMIUM2K (2k human-validated subset)

Benchmarks

FACT-BENCH (20k closed-book QA)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Model scale improves factual recall.

Instruction-tuned models often score lower on factual recall than their pretraining-only versions.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding