Break long model outputs into atomic facts and score the share supported by a knowledge source (FACTSCORE); an automatic estimator matches人s

May 23, 20239 min

Overview

Decision SnapshotNeeds Validation

The method is practical: humans validated a real dataset and a retrieval+LM+NP estimator reproduces human FACTSCORE with low aggregate error; limitations remain for nuanced domains and individual fact judgments.

Citations14

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FACTSCORE gives a concrete, scalable way to measure how much of a long model output is actually supported by a trusted source; use it to audit model factuality, compare model variants, and prioritize fixes where unsupported claims can cause harm or liability.

Who Should Care

Summary TLDR

FACTSCORE measures factual precision by splitting long model outputs into short "atomic facts" and checking each against a chosen knowledge source (here, English Wikipedia). Human annotation on biographies finds state-of-the-art commercial models score poorly (InstructGPT 42.5%, ChatGPT 58.3%, PerplexityAI 71.5%). The paper also builds an automated estimator (retrieval + LM + a nonparametric check) that reproduces human FACTSCORE with about a 2% aggregate error and is used to evaluate 13 models at scale.

Problem Statement

Long-form LM outputs mix true and false statements, so a single binary judgment or sentence-level check hides partial correctness. Human validation is accurate but slow and costly. We need a fine-grained, scalable metric to quantify how much of a long output is actually supported by a reliable knowledge source.

Main Contribution

FACTSCORE: a clear definition that decomposes long outputs into atomic facts (one fact per short statement) and reports the percentage supported by a chosen knowledge source.

A human-annotated dataset of people biographies evaluated against English Wikipedia; results show substantial factual errors in commercial models.

Key Findings

Commercial LMs have low factual precision on people biographies

NumbersFACTSCORE: InstructGPT 42.5%, ChatGPT 58.3%, PerplexityAI 71.5% (Table 1)

Practical UseExpect many unsupported facts in long outputs; use FACTSCORE to measure and compare models rather than trusting fluency alone.

Evidence RefTable 1; Section 3.4

Automated estimator approximates human FACTSCORE closely

NumbersAggregate estimation error < 2% and Pearson r ≈ 0.99 across 13 models

Practical UseYou can scale factual evaluation with a retrieval+LM+NP pipeline instead of costly human checks for many models or prompts.

Evidence RefAbstract; Section 4.2 and 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
FACTSCORE (human annotation)InstructGPT 42.5% | ChatGPT 58.3% | PerplexityAI 71.5%People biographies vs. English Wikipedia (183 entities annotated)Table 1; Section 3.4Table 1
Atomic facts per response (human)InstructGPT 26.3 | ChatGPT 34.7 | PerplexityAI 40.8 (avg facts/response)People biographiesTable 1 statisticsTable 1

What To Try In 7 Days

Run FACTSCORE (pip install factscore) on a small sample of your model's long outputs (50–200 docs) and report FACTSCORE, %abstain, and avg #atomic facts.

Add a retrieval-backed FACTSCORE estimator (Retrieve→LM + NP) to your CI checks to catch regressions quickly without human labeling.

Focus testing on rare entities and later output positions, where errors concentrate, and compare with/without retrieval.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Code URLs

Risks & Boundaries

Limitations

Experiments focus on people biographies and English Wikipedia; results may differ in subjective, nuanced, or fast-moving domains.

FACTSCORE measures precision only, not recall (it rewards supported facts but not coverage of required facts).

When Not To Use

For subjective or debatable content where support is not binary.

When a reliable, comprehensive knowledge source for the domain is unavailable or highly inconsistent.

Failure Modes

Retrieval misses direct evidence (no supporting passage on retrieved snippets) causing false Not-supported labels.

LMEVALs biased to assign Supported (overestimate) or Not-supported (underestimate) depending on model priors.

Core Entities

Models

InstructGPT (text-davinci-003)ChatGPTGPT-4PerplexityAIAlpaca (7B/13B/65B)Vicuna (7B/13B)Dolly 12BOasst-pythia 12BStableLM-tuned-alpha 7BMPT Chat 7BLLAMA (7B,65B variants)

Metrics

FACTSCORE (fraction of atomic facts supported)Error Rate (ER) between human and auto FACTSCOREF1MICRO (segment-level per-fact F1 for Not-supported)

Datasets

English Wikipedia (April 2023 snapshot)Wikidata (for entity sampling)ACL Anthology (small proof-of-concept test)

Benchmarks

FACTSCORE (this work)