Overview
The protocol and benchmark are practical and low-cost; evidence includes multi-domain data, high inter-annotator agreement, and public release, but models still lag humans so further method work is needed.
Citations12
Evidence Strength0.80
Confidence0.88
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Before relying on LLMs to flag factual errors, validate them on tough, domain-specific tests; cheap in-house benchmarks like SUMMEDITS catch real gaps and cut annotation cost dramatically.
Who Should Care
Summary TLDR
The paper tests many LLMs as factuality detectors for summarization, finds they look good on simple benchmarks but fail on harder or more precise tests and often give bad explanations. The authors introduce SUMMEDITS: a protocol and a 10-domain benchmark built from small verified seed summaries plus many atomic edits. SUMMEDITS is reproducible (high annotator agreement), cheap (~$300/domain), and challenging: GPT-4 scores 82.4% balanced accuracy vs ~90.9% estimated human performance. The dataset and code are released.
Problem Statement
Current factual-consistency benchmarks and simple accuracy metrics overestimate LLMs' ability to detect factual errors in summaries. Benchmarks contain label noise and are costly to create, and LLMs often give incorrect or uninformative explanations. We need a cheap, reproducible benchmark and a careful analysis of LLM failure modes.
Main Contribution
Systematic evaluation of many LLMs and specialized detectors on multiple factual-consistency benchmarks, with analysis of explanation quality.
A three-step, low-cost protocol to build factual-consistency detection benchmarks from verified seed summaries and many small edits.
Key Findings
LLMs match or beat specialized methods on simple benchmarks but degrade on harder settings.
LLM explanations often fail to pinpoint the factual error.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 82.4% | Human perf. 90.9% | −8.5 pp vs human | SUMMEDITS (10 domains) | Table 9 reports GPT-4 82.4 and Human 90.9 overall | Table 9 |
| SUMMEDITS specialized baseline | QAFactEval 65.7% balanced accuracy | — | — | SUMMEDITS overall | Table 9 shows QAFactEval 65.7% | Table 9 |
What To Try In 7 Days
Run SUMMEDITS (or a domain subset) on your model to spot factual failure modes.
Create a small in-domain benchmark using the seed+edits protocol (~$300/domain).
Manually review a sample of LLM explanations; prefer models that abstain or explain correctly over plausible-sounding but wrong answers.
Reproducibility
Risks & Boundaries
Limitations
Edit-generation used ChatGPT, so benchmark distribution may favor models similar to that LLM (Section 7).
SUMMEDITS tests detection on edited summaries, not summarizer generation quality or real-world model outputs (Section 7).
When Not To Use
To evaluate a summarizer's native tendency to hallucinate (SUMMEDITS uses synthetic edits).
As the sole metric for production readiness; human review still needed where accuracy matters.
Failure Modes
Model abstains or omits explanation despite prompt (no explanation).
Model gives plausible-sounding but incorrect explanations (misleading confidence).

