Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
0
Why It Matters For Business
FACT-BENCH reveals real-world gaps in LLM factual recall: larger models help, but instruction tuning and bad exemplars can reduce factual accuracy—so test models on verified QA sets and avoid training or prompting with incorrect facts.
Summary TLDR
The authors build FACT-BENCH, a 20k closed-book QA benchmark (20 domains, 134 property types, entities/dates/numbers) to test what facts LLMs recall from pretraining. They evaluate 31 models and find: larger models recall more facts; instruction-tuning often reduces factual recall; counterfactual in-context examples that contradict a model's existing knowledge can sharply break large models; and fine-tuning on knowledge the model already knows helps, while fine-tuning on unknown facts harms factuality. The benchmark and controlled tests expose practical risks when using exemplars or mixed/unknown training data.
Problem Statement
We need a broad, clean test of whether LLMs actually remember factual knowledge learned in pretraining. Existing benchmarks use limited properties or rigid templates and leave questions about: coverage across domains and answer types, effects of instruction-tuning, sensitivity to in-context exemplars that conflict with model memory, and how fine-tuning on known vs unknown facts changes factuality.
Main Contribution
FACT-BENCH: a 20K closed-book QA benchmark covering 20 domains, 134 properties, and 3 answer types (entities, dates, numbers).
A large-scale evaluation of 31 models across 10 families, showing scaling benefits and instruction-tuning costs on factual recall.
Controlled counterfactual in-context learning (ICL) experiments that isolate when exemplars damage factual recall.
Fine-tuning experiments that compare training on examples the model already knows vs unknown or mixed examples.
Key Findings
Model scale improves factual recall.
Instruction-tuned models often score lower on factual recall than their pretraining-only versions.
Counterfactual in-context exemplars that contradict a model's known facts can cause large drops in recall.
Degradation from counterfactual exemplars is strongest when exemplars contradict model-known facts and grows with the number of such exemplars.
Fine-tuning on facts the model already knows improves factual recall; fine-tuning on unknown facts hurts and counterfactual fine-tuning is damaging.
Even best model has a large gap to human-validated upper-bound.
Results
10-shot Exact Match (EM)
10-shot Exact Match (EM)
10-shot Exact Match (EM)
10-shot Exact Match (EM)
Effect of counterfactual exemplars (10-shot EM drop)
Fine-tune known vs unknown (zero-shot EM)
Human upper-bound (PREMIUM2K)
Who Should Care
What To Try In 7 Days
Run your model on a subset of FACT-BENCH or PREMIUM2K to measure factual EM and 'Contains' metrics.
Audit few-shot exemplars: remove or correct any exemplar that contradicts known facts the model answers correctly.
If fine-tuning, create a ‘known-good’ subset (examples the model already answers correctly) and prioritize it over unknown/mixed data.
Reproducibility
Data Urls
- planned public release of FACT-BENCH (paper states intent)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Model selection limited to representative and available models at test time; not exhaustive (Section 8).
- Known vs unknown knowledge approximated by whether the model answered correctly; this proxy can misclassify edge cases (Section 8).
- Benchmark is closed-book: it measures parametric recall, not retrieval-augmented or multi-hop reasoning.
When Not To Use
- To evaluate retrieval-augmented systems or open-book QA.
- For complex multi-hop or reasoning-heavy tasks where recall is not the bottleneck.
- When pretraining corpora differ sharply from Wikipedia-based grounding assumptions.
Failure Modes
- Instruction tuning reduces exact-match factual recall for some families.
- Feeding counterfactual exemplars can teach models to repeat false answers.
- Fine-tuning on unknown or corrupted facts increases hallucinations.
Core Entities
Models
- GPT-4
- GPT-3.5-turbo
- LLaMA-7B
- LLaMA-13B
- LLaMA-33B
- LLaMA-65B
- Vicuna-7B
- Vicuna-13B
- Vicuna-33B
- BLOOM-7B
- BLOOMZ-7B
- FLAN-T5-XXL
- T0++
- UL2
- FLAN-UL2
- Falcon-7B
- Falcon-40B
- Falcon-180B
- MPT-7B
- MPT-30B
- Pythia-6.9B
- Pythia-12B
- Mistral-7B
Metrics
- Exact Match (EM)
- F1
- Contains (substring match)
Datasets
- FACT-BENCH
- PREMIUM2K (2k human-validated subset)
Benchmarks
- FACT-BENCH (20k closed-book QA)

