Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
12
Why It Matters For Business
Before relying on LLMs to flag factual errors, validate them on tough, domain-specific tests; cheap in-house benchmarks like SUMMEDITS catch real gaps and cut annotation cost dramatically.
Summary TLDR
The paper tests many LLMs as factuality detectors for summarization, finds they look good on simple benchmarks but fail on harder or more precise tests and often give bad explanations. The authors introduce SUMMEDITS: a protocol and a 10-domain benchmark built from small verified seed summaries plus many atomic edits. SUMMEDITS is reproducible (high annotator agreement), cheap (~$300/domain), and challenging: GPT-4 scores 82.4% balanced accuracy vs ~90.9% estimated human performance. The dataset and code are released.
Problem Statement
Current factual-consistency benchmarks and simple accuracy metrics overestimate LLMs' ability to detect factual errors in summaries. Benchmarks contain label noise and are costly to create, and LLMs often give incorrect or uninformative explanations. We need a cheap, reproducible benchmark and a careful analysis of LLM failure modes.
Main Contribution
Systematic evaluation of many LLMs and specialized detectors on multiple factual-consistency benchmarks, with analysis of explanation quality.
A three-step, low-cost protocol to build factual-consistency detection benchmarks from verified seed summaries and many small edits.
SUMMEDITS: a 10-domain benchmark (6,348 edited summaries) built with the protocol, with high inter-annotator agreement and public release.
Empirical findings: LLMs often fail on fine-grained edits and produce poor explanations; GPT-4 remains below human performance by ~8% on SUMMEDITS.
Key Findings
LLMs match or beat specialized methods on simple benchmarks but degrade on harder settings.
LLM explanations often fail to pinpoint the factual error.
Crowd-based benchmarks contain label noise detectable by LLM explanations.
SUMMEDITS is highly reproducible and cost-effective.
Most LLMs struggle on SUMMEDITS; GPT-4 remains below human level.
Results
Accuracy
SUMMEDITS specialized baseline
Accuracy
AggreFact label noise found
SUMMEDITS reproducibility
Who Should Care
What To Try In 7 Days
Run SUMMEDITS (or a domain subset) on your model to spot factual failure modes.
Create a small in-domain benchmark using the seed+edits protocol (~$300/domain).
Manually review a sample of LLM explanations; prefer models that abstain or explain correctly over plausible-sounding but wrong answers.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Edit-generation used ChatGPT, so benchmark distribution may favor models similar to that LLM (Section 7).
- SUMMEDITS tests detection on edited summaries, not summarizer generation quality or real-world model outputs (Section 7).
- Oracle setting (seed summary available) improves model numbers, so actual task remains challenging without scaffolding (Section 6.3).
When Not To Use
- To evaluate a summarizer's native tendency to hallucinate (SUMMEDITS uses synthetic edits).
- As the sole metric for production readiness; human review still needed where accuracy matters.
Failure Modes
- Model abstains or omits explanation despite prompt (no explanation).
- Model gives plausible-sounding but incorrect explanations (misleading confidence).
- Models detect general inconsistency but fail fine-grained error-type discrimination.
Core Entities
Models
- GPT-4
- GPT3.5-turbo
- ChatGPT
- PaLM2-Bison
- text-davinci-003
- davinci-002
- davinci-001
- SummaC
- DAE
- QAFactEval
- Claude V1.3
- Bard
- Vicuna-13b
- LLaMa-13b
- Alpaca-13b
- MPT-7B-Chat
- Cohere-CMD-XL
- Dolly-v2-12B
Metrics
- Accuracy
- correlation
- Cohen's kappa
- Krippendorff's alpha
- precision
- recall
- F1
Datasets
- SUMMEDITS
- FactCC
- AggreFact
- DialSummEval
- SamSum
- SciTLDR
- BillSum
- QMSum
- ECTSum
- News (recent feeds)
- Podcast
Benchmarks
- SUMMEDITS
- FactCC
- AggreFact
- DialSummEval
Context Entities
Models
- text-ada-001
- text-babbage-001
- text-curie-001
- text-davinci-001
- text-davinci-002
- text-davinci-003
- PaLM-v2-bison
- Claude (Anthropic)
- Cohere command-xlarge

