Overview
Production Readiness
0.6
Novelty Score
0.35
Cost Impact Score
0.4
Citation Count
64
Why It Matters For Business
If you want usable zero-shot news summaries, use instruction-tuned LLMs rather than largest-parameter models; validate or replace public benchmark references before trusting automatic metrics.
Summary TLDR
Human evaluation of ten large language models (LLMs) on CNN/DailyMail and XSUM shows instruction-tuned models (Instruct GPT-3 family) deliver strong zero-shot summarization and often beat larger non-instruction models. Common benchmarks (CNN/DM, XSUM) contain low-quality reference summaries that weaken automatic metrics and understate supervised finetuning. When high-quality summaries from freelance writers are used, the best LLM (Instruct Davinci) is rated comparable to human writers, though styles differ (LLM outputs are far more extractive).
Problem Statement
It's unclear which design choices (model scale, in-context examples, instruction tuning) drive LLM summarization success, and standard news benchmarks use low-quality reference summaries that can mislead metric-based evaluation.
Main Contribution
A human evaluation benchmark of ten diverse LLMs on CNN/DailyMail and XSUM, isolating zero-shot and five-shot settings.
Empirical finding that instruction tuning, not model size, is the primary factor for strong zero-shot summarization.
Evidence that existing reference summaries (CNN/DM, XSUM) are low quality; release of higher-quality freelance-written summaries and evaluation data.
Key Findings
Instruction tuning yields much stronger zero-shot summarization than model scale.
Benchmark reference summaries are often lower quality than model outputs.
Best LLM is judged comparable to freelance human writers in blind pairwise tests.
Reference-based automatic metrics can be misleading when references are poor.
LLM summaries are more extractive than freelance writer summaries.
Results
Zero-shot faithfulness (CNN/DailyMail)
Freelance vs Instruct Davinci quality (human control)
ROUGE-L vs human (XSUM)
Who Should Care
What To Try In 7 Days
Run a quick zero-shot test with an instruction-tuned API model (Instruct Davinci/Curie) on 50 articles and inspect outputs.
Manually re-evaluate 20 benchmark references; if many are low quality, compute metrics against writer-quality references instead.
Add a short prompt to reduce extractiveness (ask for paraphrase) and compare coverage/density on a small sample.
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only 100 examples per dataset were sampled, so results may not generalize to all news articles.
- Model access was limited: only some models were evaluated in zero-shot and five-shot settings.
- Instruction-tuning training details (datasets/algorithms) are not fully known, limiting causal claims.
- Annotator preferences showed high variability, lowering statistical power for some comparisons.
When Not To Use
- For multimodal or multi-document summarization tasks not covered here.
- When you rely solely on automatic, reference-based metrics without auditing reference quality.
- If you need tight control of abstractive style—LLMs studied are highly extractive by default.
Failure Modes
- Models may ignore or misfollow instructions, producing irrelevant text (observed for non-instruction GPT-3).
- Hallucinations or factual errors remain possible despite high human-rated faithfulness on average.
- Automatic metric scores can be misleading when reference summaries are low quality.
Core Entities
Models
- GPT-3 (Ada/Curie/Davinci)
- InstructGPT (Ada/Curie/Davinci)
- OPT 175B
- GLM 130B
- Cohere XL
- Anthropic-LM v4-s3
- Pegasus
- BRIO
Metrics
- Faithfulness (binary human)
- Coherence (1-5 human)
- Relevance (1-5 human)
- ROUGE-L
- METEOR
- BertScore
- BLEURT
- BARTScore
Datasets
- CNN/DailyMail
- XSUM
- Freelance-writer summaries (collected)
Benchmarks
- CNN/DailyMail evaluation
- XSUM evaluation
- Human evaluation benchmark (this paper)

