SUMMEDITS: a low-cost, reproducible benchmark showing most LLMs still fail at fine-grained factual consistency

Overview

Decision SnapshotReady For Pilot

The protocol and benchmark are practical and low-cost; evidence includes multi-domain data, high inter-annotator agreement, and public release, but models still lag humans so further method work is needed.

Citations12

Evidence Strength0.80

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Before relying on LLMs to flag factual errors, validate them on tough, domain-specific tests; cheap in-house benchmarks like SUMMEDITS catch real gaps and cut annotation cost dramatically.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The paper tests many LLMs as factuality detectors for summarization, finds they look good on simple benchmarks but fail on harder or more precise tests and often give bad explanations. The authors introduce SUMMEDITS: a protocol and a 10-domain benchmark built from small verified seed summaries plus many atomic edits. SUMMEDITS is reproducible (high annotator agreement), cheap (~$300/domain), and challenging: GPT-4 scores 82.4% balanced accuracy vs ~90.9% estimated human performance. The dataset and code are released.

Problem Statement

Current factual-consistency benchmarks and simple accuracy metrics overestimate LLMs' ability to detect factual errors in summaries. Benchmarks contain label noise and are costly to create, and LLMs often give incorrect or uninformative explanations. We need a cheap, reproducible benchmark and a careful analysis of LLM failure modes.

Main Contribution

Systematic evaluation of many LLMs and specialized detectors on multiple factual-consistency benchmarks, with analysis of explanation quality.

A three-step, low-cost protocol to build factual-consistency detection benchmarks from verified seed summaries and many small edits.

Key Findings

LLMs match or beat specialized methods on simple benchmarks but degrade on harder settings.

NumbersFactCC GPT-4 balanced acc. 91.3% (Table 1); SUMMEDITS overall GPT-4 82.4% vs QAFactEval 65.7% (Table 9)

Practical UseDo not trust single benchmark accuracy. Validate models on harder, domain-specific tests before deployment.

Evidence RefTables 1,9

LLM explanations often fail to pinpoint the factual error.

Numbers9/16 models gave <10% correct explanations; only GPT-4, Claude v1.3, Bard >50% (Section 3.5)

Practical UseWhen using LLMs as judges, manually sample explanations; prefer models that reliably explain, or abstain when unsure.

Evidence RefSection 3.5, Figure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 82.4%	Human perf. 90.9%	−8.5 pp vs human	SUMMEDITS (10 domains)	Table 9 reports GPT-4 82.4 and Human 90.9 overall	Table 9
SUMMEDITS specialized baseline	QAFactEval 65.7% balanced accuracy	—	—	SUMMEDITS overall	Table 9 shows QAFactEval 65.7%	Table 9

What To Try In 7 Days

Run SUMMEDITS (or a domain subset) on your model to spot factual failure modes.

Create a small in-domain benchmark using the seed+edits protocol (~$300/domain).

Manually review a sample of LLM explanations; prefer models that abstain or explain correctly over plausible-sounding but wrong answers.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/salesforce/factualNLG

Data URLs

https://github.com/salesforce/factualNLG

Risks & Boundaries

Limitations

Edit-generation used ChatGPT, so benchmark distribution may favor models similar to that LLM (Section 7).

SUMMEDITS tests detection on edited summaries, not summarizer generation quality or real-world model outputs (Section 7).

When Not To Use

To evaluate a summarizer's native tendency to hallucinate (SUMMEDITS uses synthetic edits).

As the sole metric for production readiness; human review still needed where accuracy matters.

Failure Modes

Model abstains or omits explanation despite prompt (no explanation).

Model gives plausible-sounding but incorrect explanations (misleading confidence).

Core Entities

Models

GPT-4GPT3.5-turboChatGPTPaLM2-Bisontext-davinci-003davinci-002davinci-001SummaCDAEQAFactEvalClaude V1.3BardVicuna-13bLLaMa-13bAlpaca-13bMPT-7B-ChatCohere-CMD-XLDolly-v2-12B

Metrics

AccuracycorrelationCohen's kappaKrippendorff's alphaprecisionrecallF1

Datasets

SUMMEDITSFactCCAggreFactDialSummEvalSamSumSciTLDRBillSumQMSumECTSumNews (recent feeds)Podcast

Benchmarks

SUMMEDITSFactCCAggreFactDialSummEval

Context Entities

Models

text-ada-001text-babbage-001text-curie-001text-davinci-001text-davinci-002text-davinci-003PaLM-v2-bisonClaude (Anthropic)Cohere command-xlarge

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs match or beat specialized methods on simple benchmarks but degrade on harder settings.

LLM explanations often fail to pinpoint the factual error.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding