Instruction tuning, not model size, drives LLM zero-shot news summarization; benchmark references are often worse than generated summaries.

January 31, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.35

Cost Impact Score

0.4

Citation Count

64

Authors

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto

Links

Abstract / PDF

Why It Matters For Business

If you want usable zero-shot news summaries, use instruction-tuned LLMs rather than largest-parameter models; validate or replace public benchmark references before trusting automatic metrics.

Summary TLDR

Human evaluation of ten large language models (LLMs) on CNN/DailyMail and XSUM shows instruction-tuned models (Instruct GPT-3 family) deliver strong zero-shot summarization and often beat larger non-instruction models. Common benchmarks (CNN/DM, XSUM) contain low-quality reference summaries that weaken automatic metrics and understate supervised finetuning. When high-quality summaries from freelance writers are used, the best LLM (Instruct Davinci) is rated comparable to human writers, though styles differ (LLM outputs are far more extractive).

Problem Statement

It's unclear which design choices (model scale, in-context examples, instruction tuning) drive LLM summarization success, and standard news benchmarks use low-quality reference summaries that can mislead metric-based evaluation.

Main Contribution

A human evaluation benchmark of ten diverse LLMs on CNN/DailyMail and XSUM, isolating zero-shot and five-shot settings.

Empirical finding that instruction tuning, not model size, is the primary factor for strong zero-shot summarization.

Evidence that existing reference summaries (CNN/DM, XSUM) are low quality; release of higher-quality freelance-written summaries and evaluation data.

Key Findings

Instruction tuning yields much stronger zero-shot summarization than model scale.

NumbersZero-shot Instruct Davinci faithfulness 0.99 vs GPT-3 175B faithfulness 0.76 on CNN/DM (Table 2)

Benchmark reference summaries are often lower quality than model outputs.

NumbersFreelance reference faithfulness 0.93, Instruct Davinci 0.98, original references 0.64 (Table 4)

Best LLM is judged comparable to freelance human writers in blind pairwise tests.

NumbersAnnotators show no aggregate preference; more-abstractive summaries judged more informative 51.1% (Section 4.2, Fig.5)

Reference-based automatic metrics can be misleading when references are poor.

NumbersRouge-L vs human faithfulness on XSUM = -0.27 with original refs (Table 3); correlation becomes positive with freelance-

LLM summaries are more extractive than freelance writer summaries.

NumbersCoverage/density: Instruct Davinci 0.92/12.1 vs writers 0.81/2.07 (Section 4.2)

Results

Zero-shot faithfulness (CNN/DailyMail)

ValueInstruct Davinci 0.99

BaselineGPT-3 175B 0.76

Freelance vs Instruct Davinci quality (human control)

ValueFreelance writer faithfulness 0.93; coherence 4.39; relevance 4.26

BaselineInstruct Davinci 0.98 / 4.26 / 4.40

ROUGE-L vs human (XSUM)

ValueROUGE-L Kendall's tau = -0.27 (faithfulness)

Who Should Care

What To Try In 7 Days

Run a quick zero-shot test with an instruction-tuned API model (Instruct Davinci/Curie) on 50 articles and inspect outputs.

Manually re-evaluate 20 benchmark references; if many are low quality, compute metrics against writer-quality references instead.

Add a short prompt to reduce extractiveness (ask for paraphrase) and compare coverage/density on a small sample.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only 100 examples per dataset were sampled, so results may not generalize to all news articles.
  • Model access was limited: only some models were evaluated in zero-shot and five-shot settings.
  • Instruction-tuning training details (datasets/algorithms) are not fully known, limiting causal claims.
  • Annotator preferences showed high variability, lowering statistical power for some comparisons.

When Not To Use

  • For multimodal or multi-document summarization tasks not covered here.
  • When you rely solely on automatic, reference-based metrics without auditing reference quality.
  • If you need tight control of abstractive style—LLMs studied are highly extractive by default.

Failure Modes

  • Models may ignore or misfollow instructions, producing irrelevant text (observed for non-instruction GPT-3).
  • Hallucinations or factual errors remain possible despite high human-rated faithfulness on average.
  • Automatic metric scores can be misleading when reference summaries are low quality.

Core Entities

Models

  • GPT-3 (Ada/Curie/Davinci)
  • InstructGPT (Ada/Curie/Davinci)
  • OPT 175B
  • GLM 130B
  • Cohere XL
  • Anthropic-LM v4-s3
  • Pegasus
  • BRIO

Metrics

  • Faithfulness (binary human)
  • Coherence (1-5 human)
  • Relevance (1-5 human)
  • ROUGE-L
  • METEOR
  • BertScore
  • BLEURT
  • BARTScore

Datasets

  • CNN/DailyMail
  • XSUM
  • Freelance-writer summaries (collected)

Benchmarks

  • CNN/DailyMail evaluation
  • XSUM evaluation
  • Human evaluation benchmark (this paper)