SUMMEDITS: a low-cost, reproducible benchmark showing most LLMs still fail at fine-grained factual consistency

May 23, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

12

Authors

Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu

Links

Abstract / PDF

Why It Matters For Business

Before relying on LLMs to flag factual errors, validate them on tough, domain-specific tests; cheap in-house benchmarks like SUMMEDITS catch real gaps and cut annotation cost dramatically.

Summary TLDR

The paper tests many LLMs as factuality detectors for summarization, finds they look good on simple benchmarks but fail on harder or more precise tests and often give bad explanations. The authors introduce SUMMEDITS: a protocol and a 10-domain benchmark built from small verified seed summaries plus many atomic edits. SUMMEDITS is reproducible (high annotator agreement), cheap (~$300/domain), and challenging: GPT-4 scores 82.4% balanced accuracy vs ~90.9% estimated human performance. The dataset and code are released.

Problem Statement

Current factual-consistency benchmarks and simple accuracy metrics overestimate LLMs' ability to detect factual errors in summaries. Benchmarks contain label noise and are costly to create, and LLMs often give incorrect or uninformative explanations. We need a cheap, reproducible benchmark and a careful analysis of LLM failure modes.

Main Contribution

Systematic evaluation of many LLMs and specialized detectors on multiple factual-consistency benchmarks, with analysis of explanation quality.

A three-step, low-cost protocol to build factual-consistency detection benchmarks from verified seed summaries and many small edits.

SUMMEDITS: a 10-domain benchmark (6,348 edited summaries) built with the protocol, with high inter-annotator agreement and public release.

Empirical findings: LLMs often fail on fine-grained edits and produce poor explanations; GPT-4 remains below human performance by ~8% on SUMMEDITS.

Key Findings

LLMs match or beat specialized methods on simple benchmarks but degrade on harder settings.

NumbersFactCC GPT-4 balanced acc. 91.3% (Table 1); SUMMEDITS overall GPT-4 82.4% vs QAFactEval 65.7% (Table 9)

LLM explanations often fail to pinpoint the factual error.

Numbers9/16 models gave <10% correct explanations; only GPT-4, Claude v1.3, Bard >50% (Section 3.5)

Crowd-based benchmarks contain label noise detectable by LLM explanations.

NumbersManual check: ≥80 of 101 GPT-4 conflicts in AggreFact were correct → ≥6% mislabeled samples (Section 4.2)

SUMMEDITS is highly reproducible and cost-effective.

NumbersOverall IAA Cohen's kappa ≈0.92 (filtered); annotation cost ≈USD 3,000 total, ≈USD 300 per domain (Table 8, Section 6.2)

Most LLMs struggle on SUMMEDITS; GPT-4 remains below human level.

NumbersSUMMEDITS overall: GPT-4 82.4% balanced accuracy; estimated human 90.9% (Table 9)

Results

Accuracy

ValueGPT-4 82.4%

BaselineHuman perf. 90.9%

SUMMEDITS specialized baseline

ValueQAFactEval 65.7% balanced accuracy

Accuracy

ValueGPT-4 91.3%

BaselineSummaC 96.8%

AggreFact label noise found

Value≥6% mislabeled inconsistent samples

SUMMEDITS reproducibility

ValueOverall IAA Cohen's kappa ≈0.92 (filtered)

Who Should Care

What To Try In 7 Days

Run SUMMEDITS (or a domain subset) on your model to spot factual failure modes.

Create a small in-domain benchmark using the seed+edits protocol (~$300/domain).

Manually review a sample of LLM explanations; prefer models that abstain or explain correctly over plausible-sounding but wrong answers.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Edit-generation used ChatGPT, so benchmark distribution may favor models similar to that LLM (Section 7).
  • SUMMEDITS tests detection on edited summaries, not summarizer generation quality or real-world model outputs (Section 7).
  • Oracle setting (seed summary available) improves model numbers, so actual task remains challenging without scaffolding (Section 6.3).

When Not To Use

  • To evaluate a summarizer's native tendency to hallucinate (SUMMEDITS uses synthetic edits).
  • As the sole metric for production readiness; human review still needed where accuracy matters.

Failure Modes

  • Model abstains or omits explanation despite prompt (no explanation).
  • Model gives plausible-sounding but incorrect explanations (misleading confidence).
  • Models detect general inconsistency but fail fine-grained error-type discrimination.

Core Entities

Models

  • GPT-4
  • GPT3.5-turbo
  • ChatGPT
  • PaLM2-Bison
  • text-davinci-003
  • davinci-002
  • davinci-001
  • SummaC
  • DAE
  • QAFactEval
  • Claude V1.3
  • Bard
  • Vicuna-13b
  • LLaMa-13b
  • Alpaca-13b
  • MPT-7B-Chat
  • Cohere-CMD-XL
  • Dolly-v2-12B

Metrics

  • Accuracy
  • correlation
  • Cohen's kappa
  • Krippendorff's alpha
  • precision
  • recall
  • F1

Datasets

  • SUMMEDITS
  • FactCC
  • AggreFact
  • DialSummEval
  • SamSum
  • SciTLDR
  • BillSum
  • QMSum
  • ECTSum
  • News (recent feeds)
  • Podcast

Benchmarks

  • SUMMEDITS
  • FactCC
  • AggreFact
  • DialSummEval

Context Entities

Models

  • text-ada-001
  • text-babbage-001
  • text-curie-001
  • text-davinci-001
  • text-davinci-002
  • text-davinci-003
  • PaLM-v2-bison
  • Claude (Anthropic)
  • Cohere command-xlarge