Human judges prefer LLM summaries; reference summaries often contain more hallucinations.

September 18, 20237 min

Overview

Decision SnapshotNeeds Validation

The evidence is human evaluation across five small, post-cutoff datasets and shows clear LLM preference and factuality advantages, but sample sizes are small (50 examples per task) and only a few LLM families were tested.

Citations32

Evidence Strength0.60

Confidence0.70

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Xiao Pu, Mingqi Gao, Xiaojun Wan

Links

Abstract / PDF

Why It Matters For Business

Zero-shot LLMs can produce higher-quality, more factual summaries than many human references and fine-tuned models, so businesses can often deploy LLM summarization directly and shift effort to dataset curation and verification.

Who Should Care

Summary TLDR

The authors built fresh, small evaluation sets (50 examples per task) and ran human pairwise comparisons across five summarization tasks. GPT-series LLMs (text-davinci-003, GPT-3.5, GPT-4) were consistently preferred over human-written references and fine-tuned models. GPT-4 produced fewer sentence-level hallucinations than human references on several tasks, and had lower rates of extrinsic hallucination (40% vs human 62% average). The paper argues that standard summarization research focused on squeezing metrics on old datasets needs rethinking; future work should target higher-quality test sets, application-driven tasks, and better evaluation.

Problem Statement

Do modern LLMs already match or beat human and fine-tuned summarizers on real summarization tasks? The paper tests zero-shot LLM generation across five tasks using newly collected post-cutoff data and human pairwise judgments to measure overall quality and factual consistency.

Main Contribution

New human evaluation datasets for five summarization tasks (50 samples each) constructed after common LLM training cutoffs.

Large-scale human pairwise comparison showing GPT-series LLMs are preferred over human references and fine-tuned models across tasks.

Key Findings

Human judges prefer LLM summaries over human-written and fine-tuned model summaries in pairwise comparisons.

NumbersHuman preference scores for LLMs exceed 50% across tasks (Figure 4).

Practical UseFor immediate summarization use, try zero-shot LLM prompts before investing in task-specific fine-tuning or chasing small metric gains on old datasets.

Evidence RefFigure 1, Figure 4 (pairwise win rates and preference scores)

GPT-4 produced fewer sentence-level hallucinations than human references on several tasks.

NumbersSentence-level hallucination counts (GPT-4 vs Human): single 8 vs 13; multi 5 vs 62; cross-lingual 16 vs 15; dialogue 5

Practical UseUse modern LLMs to reduce some factual errors in summaries, but still verify critical facts with source checks.

Evidence RefTable 1 (hallucination counts per task)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Human preference (pairwise win rate)LLMs preferred over human and fine-tuned systems across tasksFive tasks (single-news, multi-news, cross-lingual, dialogue, code)Figure 1 and Figure 4 show LLM win rates and preference scores >50%Figure 1, Figure 4
Sentence-level hallucination counts (GPT-4)single 8, multi 5, cross-lingual 16, dialogue 5, code 9Human: single 13, multi 62, cross-lingual 15, dialogue 15, code 46GPT-4 has fewer hallucinations in several tasks, large gap on multi-news and codeTable 1 per-task countsTable 1 lists counts for GPT-4 and human referencesTable 1

What To Try In 7 Days

Run zero-shot prompts on your summaries and compare against existing pipeline outputs using quick pairwise human checks.

Audit your reference summaries for extrinsic facts not present in sources and correct them.

Replace ROUGE-only checks with a small human evaluation focusing on factuality and usefulness.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Each evaluation dataset contains only 50 samples per task, limiting statistical power.

Only GPT family models were tested; LLaMA/Vicuna excluded due to unknown training cutoffs.

When Not To Use

Don't generalize results to LLMs with unknown data cutoff or different architectures.

Don't assume the same gains hold on very long documents or niche domains not covered by the five tasks.

Failure Modes

Reference summaries contain extrinsic facts and can mislead both training and evaluation.

Small evaluation sets may overstate LLM advantages on broader data distributions.

Core Entities

Models

GPT-3 (text-davinci-003)GPT-3.5GPT-4BARTT5PegasusMT5MBARTCodet5

Metrics

Pairwise win rate (human preference)Human preference scoreSentence-level hallucination countsProportion of extrinsic hallucinationsCohen's kappa (0.558)

Datasets

New single-news (post-2021, 50 samples)New multi-news (post-2021, 50 samples)New dialogue (post-2021, 50 samples)New cross-lingual (translated single-news, 50 samples)New code (Go programs, 50 samples)