SUMMEDITS: a low-cost, reproducible benchmark showing most LLMs still fail at fine-grained factual consistency

May 23, 20237 min

Overview

Decision SnapshotReady For Pilot

The protocol and benchmark are practical and low-cost; evidence includes multi-domain data, high inter-annotator agreement, and public release, but models still lag humans so further method work is needed.

Citations12

Evidence Strength0.80

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Before relying on LLMs to flag factual errors, validate them on tough, domain-specific tests; cheap in-house benchmarks like SUMMEDITS catch real gaps and cut annotation cost dramatically.

Who Should Care

Summary TLDR

The paper tests many LLMs as factuality detectors for summarization, finds they look good on simple benchmarks but fail on harder or more precise tests and often give bad explanations. The authors introduce SUMMEDITS: a protocol and a 10-domain benchmark built from small verified seed summaries plus many atomic edits. SUMMEDITS is reproducible (high annotator agreement), cheap (~$300/domain), and challenging: GPT-4 scores 82.4% balanced accuracy vs ~90.9% estimated human performance. The dataset and code are released.

Problem Statement

Current factual-consistency benchmarks and simple accuracy metrics overestimate LLMs' ability to detect factual errors in summaries. Benchmarks contain label noise and are costly to create, and LLMs often give incorrect or uninformative explanations. We need a cheap, reproducible benchmark and a careful analysis of LLM failure modes.

Main Contribution

Systematic evaluation of many LLMs and specialized detectors on multiple factual-consistency benchmarks, with analysis of explanation quality.

A three-step, low-cost protocol to build factual-consistency detection benchmarks from verified seed summaries and many small edits.

Key Findings

LLMs match or beat specialized methods on simple benchmarks but degrade on harder settings.

NumbersFactCC GPT-4 balanced acc. 91.3% (Table 1); SUMMEDITS overall GPT-4 82.4% vs QAFactEval 65.7% (Table 9)

Practical UseDo not trust single benchmark accuracy. Validate models on harder, domain-specific tests before deployment.

Evidence RefTables 1,9

LLM explanations often fail to pinpoint the factual error.

Numbers9/16 models gave <10% correct explanations; only GPT-4, Claude v1.3, Bard >50% (Section 3.5)

Practical UseWhen using LLMs as judges, manually sample explanations; prefer models that reliably explain, or abstain when unsure.

Evidence RefSection 3.5, Figure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 82.4%Human perf. 90.9%−8.5 pp vs humanSUMMEDITS (10 domains)Table 9 reports GPT-4 82.4 and Human 90.9 overallTable 9
SUMMEDITS specialized baselineQAFactEval 65.7% balanced accuracySUMMEDITS overallTable 9 shows QAFactEval 65.7%Table 9

What To Try In 7 Days

Run SUMMEDITS (or a domain subset) on your model to spot factual failure modes.

Create a small in-domain benchmark using the seed+edits protocol (~$300/domain).

Manually review a sample of LLM explanations; prefer models that abstain or explain correctly over plausible-sounding but wrong answers.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Edit-generation used ChatGPT, so benchmark distribution may favor models similar to that LLM (Section 7).

SUMMEDITS tests detection on edited summaries, not summarizer generation quality or real-world model outputs (Section 7).

When Not To Use

To evaluate a summarizer's native tendency to hallucinate (SUMMEDITS uses synthetic edits).

As the sole metric for production readiness; human review still needed where accuracy matters.

Failure Modes

Model abstains or omits explanation despite prompt (no explanation).

Model gives plausible-sounding but incorrect explanations (misleading confidence).

Core Entities

Models

GPT-4GPT3.5-turboChatGPTPaLM2-Bisontext-davinci-003davinci-002davinci-001SummaCDAEQAFactEvalClaude V1.3BardVicuna-13bLLaMa-13bAlpaca-13bMPT-7B-ChatCohere-CMD-XLDolly-v2-12B

Metrics

AccuracycorrelationCohen's kappaKrippendorff's alphaprecisionrecallF1

Datasets

SUMMEDITSFactCCAggreFactDialSummEvalSamSumSciTLDRBillSumQMSumECTSumNews (recent feeds)Podcast

Benchmarks

SUMMEDITSFactCCAggreFactDialSummEval

Context Entities

Models

text-ada-001text-babbage-001text-curie-001text-davinci-001text-davinci-002text-davinci-003PaLM-v2-bisonClaude (Anthropic)Cohere command-xlarge