Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Cleaner, element-focused references and simple two-stage prompts can make off-the-shelf LLMs produce summaries that include the key facts businesses care about (who/when/what/result) without extra training.
Summary TLDR
The authors create expert-written, element-aware test sets for CNN/DailyMail and BBC XSum that emphasize Who/When/What/Result. Using these tests, they show large LLMs (GPT-3) are much stronger at zero-shot summarization than standard test sets suggest. They introduce SumCoT, a two-stage chain-of-thought prompt that first extracts core elements (Entity, Date, Event, Result) and then integrates them into a summary. SumCoT raises ROUGE and human-quality scores vs. GPT-3 zero-shot and outperforms fine-tuned baselines on the new test sets.
Problem Statement
Standard news summarization test sets contain noisy or incomplete reference summaries (redundancy, hallucinations). That noise hides or mis-measures how well large LLMs can write summaries zero-shot. We need cleaner, element-focused references and prompts that make LLMs include fine-grained facts.
Main Contribution
Released element-aware expert test sets for 200 examples each from CNN/DailyMail and BBC XSum focused on four core elements: Entity, Date, Event, Result.
Showed that GPT-3 zero-shot summaries score much higher against element-aware references than against original references, revealing evaluation blind spots.
Proposed SumCoT: a two-stage chain-of-thought prompt that extracts core elements then integrates them to generate more complete summaries, improving automatic metrics and human judgments.
Key Findings
Expert element-aware references strongly improve element coverage vs original references.
GPT-3 zero-shot looks systematically stronger when evaluated against element-aware references.
SumCoT (element-extract then summarize) meaningfully improves GPT-3 summaries.
Final summaries preserve the extracted elements at high rates, especially on CNN/DailyMail.
Results
ROUGE-L (GPT-3 element-aware vs dataset-specific)
ROUGE-L (GPT-3 element-aware vs dataset-specific)
ROUGE-L (SumCoT vs GPT-3 baseline)
ROUGE-L (SumCoT vs GPT-3 baseline)
Element extraction F1 (GPT-3, element extraction stage)
Element coverage (fraction of extracted elements appearing in final summary)
Who Should Care
What To Try In 7 Days
Run your LLM in two stages: first prompt for core facts (entity/date/event/result), then ask it to integrate them into a summary.
Validate automatic metric changes using a small element-aware reference set you create for your domain.
Use SumCoT-style prompts to increase factual coverage in news-like summarization before investing in fine-tuning.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Element-aware test sets are small (200 examples per dataset) and not fully domain-balanced, so results may not generalize to other news types.
- SumCoT extraction has notable errors: date hallucination and redundant or non-important element extraction.
- Effectiveness depends on large model scale; smaller GPT-3 variants performed poorly on element extraction.
When Not To Use
- If you only have small LLMs (models smaller than ~175B), because element extraction fails for small sizes.
- When you need strict date accuracy—SumCoT shows date hallucination risk during extraction.
- If you require large, domain-diverse test sets for benchmark-level claims; this paper's test sets are small.
Failure Modes
- Date hallucination: model invents dates when none exist or miscomputes relative dates.
- Element redundancy: extraction stage returns many faithful but low-importance facts that bloat summaries.
- Prompt sensitivity: SumCoT results vary with prompt wording and may be brittle across domains.
Core Entities
Models
- GPT-3 (text-davinci-002, 175B)
- BART (base, large)
- T5-LARGE
- PEGASUS-LARGE
Metrics
- ROUGE-1
- ROUGE-2
- ROUGE-L
- BERTSCORE
Datasets
- CNN/DailyMail
- BBC XSum
- Element-aware test sets (this paper)
Benchmarks
- Element-aware summarization test sets (200 samples each)

