Overview
The paper uses targeted human evaluation across 200 examples and a paired human study; conclusions about instruction tuning and reference quality are well supported for single-document news summarization but limited to the sampled datasets and models.
Citations64
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 35%
Why It Matters For Business
If you want usable zero-shot news summaries, use instruction-tuned LLMs rather than largest-parameter models; validate or replace public benchmark references before trusting automatic metrics.
Who Should Care
Summary TLDR
Human evaluation of ten large language models (LLMs) on CNN/DailyMail and XSUM shows instruction-tuned models (Instruct GPT-3 family) deliver strong zero-shot summarization and often beat larger non-instruction models. Common benchmarks (CNN/DM, XSUM) contain low-quality reference summaries that weaken automatic metrics and understate supervised finetuning. When high-quality summaries from freelance writers are used, the best LLM (Instruct Davinci) is rated comparable to human writers, though styles differ (LLM outputs are far more extractive).
Problem Statement
It's unclear which design choices (model scale, in-context examples, instruction tuning) drive LLM summarization success, and standard news benchmarks use low-quality reference summaries that can mislead metric-based evaluation.
Main Contribution
A human evaluation benchmark of ten diverse LLMs on CNN/DailyMail and XSUM, isolating zero-shot and five-shot settings.
Empirical finding that instruction tuning, not model size, is the primary factor for strong zero-shot summarization.
Key Findings
Instruction tuning yields much stronger zero-shot summarization than model scale.
Benchmark reference summaries are often lower quality than model outputs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Zero-shot faithfulness (CNN/DailyMail) | Instruct Davinci 0.99 | GPT-3 175B 0.76 | +0.23 | CNN/DailyMail validation (n=100) | Table 2 (zero-shot rows) | Table 2 |
| Freelance vs Instruct Davinci quality (human control) | Freelance writer faithfulness 0.93; coherence 4.39; relevance 4.26 | Instruct Davinci 0.98 / 4.26 / 4.40 | small differences, not statistically significant aggregate | Freelance-collected summaries (n=100 per dataset) | Table 4 and paired comparison in Section 4.2 | Table 4; Figure 5 |
What To Try In 7 Days
Run a quick zero-shot test with an instruction-tuned API model (Instruct Davinci/Curie) on 50 articles and inspect outputs.
Manually re-evaluate 20 benchmark references; if many are low quality, compute metrics against writer-quality references instead.
Add a short prompt to reduce extractiveness (ask for paraphrase) and compare coverage/density on a small sample.
Reproducibility
Risks & Boundaries
Limitations
Only 100 examples per dataset were sampled, so results may not generalize to all news articles.
Model access was limited: only some models were evaluated in zero-shot and five-shot settings.
When Not To Use
For multimodal or multi-document summarization tasks not covered here.
When you rely solely on automatic, reference-based metrics without auditing reference quality.
Failure Modes
Models may ignore or misfollow instructions, producing irrelevant text (observed for non-instruction GPT-3).
Hallucinations or factual errors remain possible despite high human-rated faithfulness on average.

