Break long model outputs into atomic facts and score the share supported by a knowledge source (FACTSCORE); an automatic estimator matches人s

Overview

Decision SnapshotNeeds Validation

The method is practical: humans validated a real dataset and a retrieval+LM+NP estimator reproduces human FACTSCORE with low aggregate error; limitations remain for nuanced domains and individual fact judgments.

Citations14

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FACTSCORE gives a concrete, scalable way to measure how much of a long model output is actually supported by a trusted source; use it to audit model factuality, compare model variants, and prioritize fixes where unsupported claims can cause harm or liability.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead

Summary TLDR

FACTSCORE measures factual precision by splitting long model outputs into short "atomic facts" and checking each against a chosen knowledge source (here, English Wikipedia). Human annotation on biographies finds state-of-the-art commercial models score poorly (InstructGPT 42.5%, ChatGPT 58.3%, PerplexityAI 71.5%). The paper also builds an automated estimator (retrieval + LM + a nonparametric check) that reproduces human FACTSCORE with about a 2% aggregate error and is used to evaluate 13 models at scale.

Problem Statement

Long-form LM outputs mix true and false statements, so a single binary judgment or sentence-level check hides partial correctness. Human validation is accurate but slow and costly. We need a fine-grained, scalable metric to quantify how much of a long output is actually supported by a reliable knowledge source.

Main Contribution

FACTSCORE: a clear definition that decomposes long outputs into atomic facts (one fact per short statement) and reports the percentage supported by a chosen knowledge source.

A human-annotated dataset of people biographies evaluated against English Wikipedia; results show substantial factual errors in commercial models.

Key Findings

Commercial LMs have low factual precision on people biographies

NumbersFACTSCORE: InstructGPT 42.5%, ChatGPT 58.3%, PerplexityAI 71.5% (Table 1)

Practical UseExpect many unsupported facts in long outputs; use FACTSCORE to measure and compare models rather than trusting fluency alone.

Evidence RefTable 1; Section 3.4

Automated estimator approximates human FACTSCORE closely

NumbersAggregate estimation error < 2% and Pearson r ≈ 0.99 across 13 models

Practical UseYou can scale factual evaluation with a retrieval+LM+NP pipeline instead of costly human checks for many models or prompts.

Evidence RefAbstract; Section 4.2 and 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
FACTSCORE (human annotation)	InstructGPT 42.5% \| ChatGPT 58.3% \| PerplexityAI 71.5%	—	—	People biographies vs. English Wikipedia (183 entities annotated)	Table 1; Section 3.4	Table 1
Atomic facts per response (human)	InstructGPT 26.3 \| ChatGPT 34.7 \| PerplexityAI 40.8 (avg facts/response)	—	—	People biographies	Table 1 statistics	Table 1

What To Try In 7 Days

Run FACTSCORE (pip install factscore) on a small sample of your model's long outputs (50–200 docs) and report FACTSCORE, %abstain, and avg #atomic facts.

Add a retrieval-backed FACTSCORE estimator (Retrieve→LM + NP) to your CI checks to catch regressions quickly without human labeling.

Focus testing on rare entities and later output positions, where errors concentrate, and compare with/without retrieval.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/shmsw25/FActScorepip package: factscore

Data URLs

https://github.com/shmsw25/FActScore (annotated data released)

Risks & Boundaries

Limitations

Experiments focus on people biographies and English Wikipedia; results may differ in subjective, nuanced, or fast-moving domains.

FACTSCORE measures precision only, not recall (it rewards supported facts but not coverage of required facts).

When Not To Use

For subjective or debatable content where support is not binary.

When a reliable, comprehensive knowledge source for the domain is unavailable or highly inconsistent.

Failure Modes

Retrieval misses direct evidence (no supporting passage on retrieved snippets) causing false Not-supported labels.

LMEVALs biased to assign Supported (overestimate) or Not-supported (underestimate) depending on model priors.

Core Entities

Models

InstructGPT (text-davinci-003)ChatGPTGPT-4PerplexityAIAlpaca (7B/13B/65B)Vicuna (7B/13B)Dolly 12BOasst-pythia 12BStableLM-tuned-alpha 7BMPT Chat 7BLLAMA (7B,65B variants)

Metrics

FACTSCORE (fraction of atomic facts supported)Error Rate (ER) between human and auto FACTSCOREF1MICRO (segment-level per-fact F1 for Not-supported)

Datasets

English Wikipedia (April 2023 snapshot)Wikidata (for entity sampling)ACL Anthology (small proof-of-concept test)

Benchmarks

FACTSCORE (this work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Commercial LMs have low factual precision on people biographies

Automated estimator approximates human FACTSCORE closely

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding