Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
14
Why It Matters For Business
FACTSCORE gives a concrete, scalable way to measure how much of a long model output is actually supported by a trusted source; use it to audit model factuality, compare model variants, and prioritize fixes where unsupported claims can cause harm or liability.
Summary TLDR
FACTSCORE measures factual precision by splitting long model outputs into short "atomic facts" and checking each against a chosen knowledge source (here, English Wikipedia). Human annotation on biographies finds state-of-the-art commercial models score poorly (InstructGPT 42.5%, ChatGPT 58.3%, PerplexityAI 71.5%). The paper also builds an automated estimator (retrieval + LM + a nonparametric check) that reproduces human FACTSCORE with about a 2% aggregate error and is used to evaluate 13 models at scale.
Problem Statement
Long-form LM outputs mix true and false statements, so a single binary judgment or sentence-level check hides partial correctness. Human validation is accurate but slow and costly. We need a fine-grained, scalable metric to quantify how much of a long output is actually supported by a reliable knowledge source.
Main Contribution
FACTSCORE: a clear definition that decomposes long outputs into atomic facts (one fact per short statement) and reports the percentage supported by a chosen knowledge source.
A human-annotated dataset of people biographies evaluated against English Wikipedia; results show substantial factual errors in commercial models.
An automated FACTSCORE estimator combining retrieval, LM prompting ("True or False?"), and a nonparametric masked-LM check that matches humans closely (<~2% aggregate error).
A large-scale case study: automatic FACTSCORE applied to 6,500 bios from 13 models to compare factual precision at scale.
Key Findings
Commercial LMs have low factual precision on people biographies
Automated estimator approximates human FACTSCORE closely
Factual errors increase for rare entities and later facts in the text
Retrieval greatly improves automatic validation
Citations provided by search-augmented models are not a reliable proxy for factuality
Results
FACTSCORE (human annotation)
Atomic facts per response (human)
Automated estimator aggregate Error Rate (ER)
Estimator ranking correlation
Coverage error when using Wikipedia as source
Who Should Care
What To Try In 7 Days
Run FACTSCORE (pip install factscore) on a small sample of your model's long outputs (50–200 docs) and report FACTSCORE, %abstain, and avg #atomic facts.
Add a retrieval-backed FACTSCORE estimator (Retrieve→LM + NP) to your CI checks to catch regressions quickly without human labeling.
Focus testing on rare entities and later output positions, where errors concentrate, and compare with/without retrieval.
Reproducibility
Code Urls
- https://github.com/shmsw25/FActScore
- pip package: factscore
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Experiments focus on people biographies and English Wikipedia; results may differ in subjective, nuanced, or fast-moving domains.
- FACTSCORE measures precision only, not recall (it rewards supported facts but not coverage of required facts).
- Automated estimator is strong at aggregate scores but imperfect on per-fact judgments and depends on retrieval quality.
- Using a single knowledge source can penalize correct facts that are absent from that source (coverage blind spot).
When Not To Use
- For subjective or debatable content where support is not binary.
- When a reliable, comprehensive knowledge source for the domain is unavailable or highly inconsistent.
- If you need recall/coverage metrics (what the model omitted) rather than precision.
Failure Modes
- Retrieval misses direct evidence (no supporting passage on retrieved snippets) causing false Not-supported labels.
- LMEVALs biased to assign Supported (overestimate) or Not-supported (underestimate) depending on model priors.
- Citation presence does not imply support; models can copy irrelevant or contradictory passages.
- Annotator disagreement for borderline or inferential facts (interpretation-dependent cases).
Core Entities
Models
- InstructGPT (text-davinci-003)
- ChatGPT
- GPT-4
- PerplexityAI
- Alpaca (7B/13B/65B)
- Vicuna (7B/13B)
- Dolly 12B
- Oasst-pythia 12B
- StableLM-tuned-alpha 7B
- MPT Chat 7B
- LLAMA (7B,65B variants)
Metrics
- FACTSCORE (fraction of atomic facts supported)
- Error Rate (ER) between human and auto FACTSCORE
- F1MICRO (segment-level per-fact F1 for Not-supported)
Datasets
- English Wikipedia (April 2023 snapshot)
- Wikidata (for entity sampling)
- ACL Anthology (small proof-of-concept test)
Benchmarks
- FACTSCORE (this work)

