Overview
The idea is simple and low-cost: prompt an LLM to extract core facts then summarize. Evidence is solid on small expert test sets and with GPT-3, but extraction errors (dates, redundancy) and model-size dependence limit out-of-the-box production use.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Cleaner, element-focused references and simple two-stage prompts can make off-the-shelf LLMs produce summaries that include the key facts businesses care about (who/when/what/result) without extra training.
Who Should Care
Summary TLDR
The authors create expert-written, element-aware test sets for CNN/DailyMail and BBC XSum that emphasize Who/When/What/Result. Using these tests, they show large LLMs (GPT-3) are much stronger at zero-shot summarization than standard test sets suggest. They introduce SumCoT, a two-stage chain-of-thought prompt that first extracts core elements (Entity, Date, Event, Result) and then integrates them into a summary. SumCoT raises ROUGE and human-quality scores vs. GPT-3 zero-shot and outperforms fine-tuned baselines on the new test sets.
Problem Statement
Standard news summarization test sets contain noisy or incomplete reference summaries (redundancy, hallucinations). That noise hides or mis-measures how well large LLMs can write summaries zero-shot. We need cleaner, element-focused references and prompts that make LLMs include fine-grained facts.
Main Contribution
Released element-aware expert test sets for 200 examples each from CNN/DailyMail and BBC XSum focused on four core elements: Entity, Date, Event, Result.
Showed that GPT-3 zero-shot summaries score much higher against element-aware references than against original references, revealing evaluation blind spots.
Key Findings
Expert element-aware references strongly improve element coverage vs original references.
GPT-3 zero-shot looks systematically stronger when evaluated against element-aware references.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ROUGE-L (GPT-3 element-aware vs dataset-specific) | 34.25 (element-aware CNN) vs 27.51 (dataset-specific CNN) | GPT-3 on dataset-specific CNN | +6.74 | CNN/DailyMail | Table 3: GPT-3 ROUGE-L 34.25 (element-aware) vs 27.51 (dataset-specific) | Table 3 |
| ROUGE-L (GPT-3 element-aware vs dataset-specific) | 25.42 (element-aware BBC) vs 15.86 (dataset-specific BBC) | GPT-3 on dataset-specific BBC | +9.56 | BBC XSum | Table 3: GPT-3 ROUGE-L 25.42 (element-aware) vs 15.86 (dataset-specific) | Table 3 |
What To Try In 7 Days
Run your LLM in two stages: first prompt for core facts (entity/date/event/result), then ask it to integrate them into a summary.
Validate automatic metric changes using a small element-aware reference set you create for your domain.
Use SumCoT-style prompts to increase factual coverage in news-like summarization before investing in fine-tuning.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Element-aware test sets are small (200 examples per dataset) and not fully domain-balanced, so results may not generalize to other news types.
SumCoT extraction has notable errors: date hallucination and redundant or non-important element extraction.
When Not To Use
If you only have small LLMs (models smaller than ~175B), because element extraction fails for small sizes.
When you need strict date accuracy—SumCoT shows date hallucination risk during extraction.
Failure Modes
Date hallucination: model invents dates when none exist or miscomputes relative dates.
Element redundancy: extraction stage returns many faithful but low-importance facts that bloat summaries.

