Break long model outputs into atomic facts and score the share supported by a knowledge source (FACTSCORE); an automatic estimator matches人s

May 23, 20239 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

14

Authors

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi

Links

Abstract / PDF

Why It Matters For Business

FACTSCORE gives a concrete, scalable way to measure how much of a long model output is actually supported by a trusted source; use it to audit model factuality, compare model variants, and prioritize fixes where unsupported claims can cause harm or liability.

Summary TLDR

FACTSCORE measures factual precision by splitting long model outputs into short "atomic facts" and checking each against a chosen knowledge source (here, English Wikipedia). Human annotation on biographies finds state-of-the-art commercial models score poorly (InstructGPT 42.5%, ChatGPT 58.3%, PerplexityAI 71.5%). The paper also builds an automated estimator (retrieval + LM + a nonparametric check) that reproduces human FACTSCORE with about a 2% aggregate error and is used to evaluate 13 models at scale.

Problem Statement

Long-form LM outputs mix true and false statements, so a single binary judgment or sentence-level check hides partial correctness. Human validation is accurate but slow and costly. We need a fine-grained, scalable metric to quantify how much of a long output is actually supported by a reliable knowledge source.

Main Contribution

FACTSCORE: a clear definition that decomposes long outputs into atomic facts (one fact per short statement) and reports the percentage supported by a chosen knowledge source.

A human-annotated dataset of people biographies evaluated against English Wikipedia; results show substantial factual errors in commercial models.

An automated FACTSCORE estimator combining retrieval, LM prompting ("True or False?"), and a nonparametric masked-LM check that matches humans closely (<~2% aggregate error).

A large-scale case study: automatic FACTSCORE applied to 6,500 bios from 13 models to compare factual precision at scale.

Key Findings

Commercial LMs have low factual precision on people biographies

NumbersFACTSCORE: InstructGPT 42.5%, ChatGPT 58.3%, PerplexityAI 71.5% (Table 1)

Automated estimator approximates human FACTSCORE closely

NumbersAggregate estimation error < 2% and Pearson r ≈ 0.99 across 13 models

Factual errors increase for rare entities and later facts in the text

NumbersChatGPT FACTSCORE drops from ~80% to ~16% across frequency levels; later positions show consistent precision decline (F1

Retrieval greatly improves automatic validation

NumbersRetrieve→LM variants outperform No-context LM; ensembles (Retrieve→LM+NP) further reduce Error Rate (best ER in many set

Citations provided by search-augmented models are not a reliable proxy for factuality

NumbersPerplexityAI: 36.0% of supported and 37.6% of unsupported sentences have citations (no strong correlation)

Results

FACTSCORE (human annotation)

ValueInstructGPT 42.5% | ChatGPT 58.3% | PerplexityAI 71.5%

Atomic facts per response (human)

ValueInstructGPT 26.3 | ChatGPT 34.7 | PerplexityAI 40.8 (avg facts/response)

Automated estimator aggregate Error Rate (ER)

Value< 2% (reported aggregate estimation error vs human)

Estimator ranking correlation

ValuePearson r ≈ 0.99 across 13 models

Coverage error when using Wikipedia as source

Value≈10% of sampled 'Not-supported' facts (rare entities) were actually supported on the wider web

Who Should Care

What To Try In 7 Days

Run FACTSCORE (pip install factscore) on a small sample of your model's long outputs (50–200 docs) and report FACTSCORE, %abstain, and avg #atomic facts.

Add a retrieval-backed FACTSCORE estimator (Retrieve→LM + NP) to your CI checks to catch regressions quickly without human labeling.

Focus testing on rare entities and later output positions, where errors concentrate, and compare with/without retrieval.

Reproducibility

Code Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Experiments focus on people biographies and English Wikipedia; results may differ in subjective, nuanced, or fast-moving domains.
  • FACTSCORE measures precision only, not recall (it rewards supported facts but not coverage of required facts).
  • Automated estimator is strong at aggregate scores but imperfect on per-fact judgments and depends on retrieval quality.
  • Using a single knowledge source can penalize correct facts that are absent from that source (coverage blind spot).

When Not To Use

  • For subjective or debatable content where support is not binary.
  • When a reliable, comprehensive knowledge source for the domain is unavailable or highly inconsistent.
  • If you need recall/coverage metrics (what the model omitted) rather than precision.

Failure Modes

  • Retrieval misses direct evidence (no supporting passage on retrieved snippets) causing false Not-supported labels.
  • LMEVALs biased to assign Supported (overestimate) or Not-supported (underestimate) depending on model priors.
  • Citation presence does not imply support; models can copy irrelevant or contradictory passages.
  • Annotator disagreement for borderline or inferential facts (interpretation-dependent cases).

Core Entities

Models

  • InstructGPT (text-davinci-003)
  • ChatGPT
  • GPT-4
  • PerplexityAI
  • Alpaca (7B/13B/65B)
  • Vicuna (7B/13B)
  • Dolly 12B
  • Oasst-pythia 12B
  • StableLM-tuned-alpha 7B
  • MPT Chat 7B
  • LLAMA (7B,65B variants)

Metrics

  • FACTSCORE (fraction of atomic facts supported)
  • Error Rate (ER) between human and auto FACTSCORE
  • F1MICRO (segment-level per-fact F1 for Not-supported)

Datasets

  • English Wikipedia (April 2023 snapshot)
  • Wikidata (for entity sampling)
  • ACL Anthology (small proof-of-concept test)

Benchmarks

  • FACTSCORE (this work)