Overview
The benchmark is well-designed and validated (2k human-checked subset). Results are robust across 31 models and multiple controlled tests, but conclusions are limited to closed-book factual recall and rely on the proxy for model-known vs unknown facts.
Citations0
Evidence Strength0.85
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/7
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
FACT-BENCH reveals real-world gaps in LLM factual recall: larger models help, but instruction tuning and bad exemplars can reduce factual accuracy—so test models on verified QA sets and avoid training or prompting with incorrect facts.
Who Should Care
Summary TLDR
The authors build FACT-BENCH, a 20k closed-book QA benchmark (20 domains, 134 property types, entities/dates/numbers) to test what facts LLMs recall from pretraining. They evaluate 31 models and find: larger models recall more facts; instruction-tuning often reduces factual recall; counterfactual in-context examples that contradict a model's existing knowledge can sharply break large models; and fine-tuning on knowledge the model already knows helps, while fine-tuning on unknown facts harms factuality. The benchmark and controlled tests expose practical risks when using exemplars or mixed/unknown training data.
Problem Statement
We need a broad, clean test of whether LLMs actually remember factual knowledge learned in pretraining. Existing benchmarks use limited properties or rigid templates and leave questions about: coverage across domains and answer types, effects of instruction-tuning, sensitivity to in-context exemplars that conflict with model memory, and how fine-tuning on known vs unknown facts changes factuality.
Main Contribution
FACT-BENCH: a 20K closed-book QA benchmark covering 20 domains, 134 properties, and 3 answer types (entities, dates, numbers).
A large-scale evaluation of 31 models across 10 families, showing scaling benefits and instruction-tuning costs on factual recall.
Key Findings
Model scale improves factual recall.
Instruction-tuned models often score lower on factual recall than their pretraining-only versions.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| 10-shot Exact Match (EM) | GPT-4 65.90% | — | — | PREMIUM2K (2k validated subset) | Table 1 (10-shot EM) | Table 1 |
| 10-shot Exact Match (EM) | GPT-3.5-turbo 53.55% | — | — | PREMIUM2K | Table 1 (10-shot EM) | Table 1 |
What To Try In 7 Days
Run your model on a subset of FACT-BENCH or PREMIUM2K to measure factual EM and 'Contains' metrics.
Audit few-shot exemplars: remove or correct any exemplar that contradicts known facts the model answers correctly.
If fine-tuning, create a ‘known-good’ subset (examples the model already answers correctly) and prioritize it over unknown/mixed data.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Model selection limited to representative and available models at test time; not exhaustive (Section 8).
Known vs unknown knowledge approximated by whether the model answered correctly; this proxy can misclassify edge cases (Section 8).
When Not To Use
To evaluate retrieval-augmented systems or open-book QA.
For complex multi-hop or reasoning-heavy tasks where recall is not the bottleneck.
Failure Modes
Instruction tuning reduces exact-match factual recall for some families.
Feeding counterfactual exemplars can teach models to repeat false answers.

