Build expert element-based test sets and use a chain-of-thought prompt (SumCoT) to get LLMs to write more complete news summaries

Overview

Decision SnapshotNeeds Validation

The idea is simple and low-cost: prompt an LLM to extract core facts then summarize. Evidence is solid on small expert test sets and with GPT-3, but extraction errors (dates, redundancy) and model-size dependence limit out-of-the-box production use.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yiming Wang, Zhuosheng Zhang, Rui Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Cleaner, element-focused references and simple two-stage prompts can make off-the-shelf LLMs produce summaries that include the key facts businesses care about (who/when/what/result) without extra training.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors create expert-written, element-aware test sets for CNN/DailyMail and BBC XSum that emphasize Who/When/What/Result. Using these tests, they show large LLMs (GPT-3) are much stronger at zero-shot summarization than standard test sets suggest. They introduce SumCoT, a two-stage chain-of-thought prompt that first extracts core elements (Entity, Date, Event, Result) and then integrates them into a summary. SumCoT raises ROUGE and human-quality scores vs. GPT-3 zero-shot and outperforms fine-tuned baselines on the new test sets.

Problem Statement

Standard news summarization test sets contain noisy or incomplete reference summaries (redundancy, hallucinations). That noise hides or mis-measures how well large LLMs can write summaries zero-shot. We need cleaner, element-focused references and prompts that make LLMs include fine-grained facts.

Main Contribution

Released element-aware expert test sets for 200 examples each from CNN/DailyMail and BBC XSum focused on four core elements: Entity, Date, Event, Result.

Showed that GPT-3 zero-shot summaries score much higher against element-aware references than against original references, revealing evaluation blind spots.

Key Findings

Expert element-aware references strongly improve element coverage vs original references.

NumbersCNN Entity F1 0.98 vs 0.68; Date F1 0.90 vs 0.69 (Table 2)

Practical UseUse element-aware references to measure whether summaries include key facts (who/when/what/result) rather than just lexical overlap.

Evidence RefTable 2

GPT-3 zero-shot looks systematically stronger when evaluated against element-aware references.

NumbersROUGE-L gain vs dataset-specific: +6.74 (CNN), +9.56 (BBC) for GPT-3 (Table 3)

Practical UseDon't trust low automatic scores on noisy references; re-evaluate LLM summaries with cleaner, element-focused references to get a clearer picture.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE-L (GPT-3 element-aware vs dataset-specific)	34.25 (element-aware CNN) vs 27.51 (dataset-specific CNN)	GPT-3 on dataset-specific CNN	+6.74	CNN/DailyMail	Table 3: GPT-3 ROUGE-L 34.25 (element-aware) vs 27.51 (dataset-specific)	Table 3
ROUGE-L (GPT-3 element-aware vs dataset-specific)	25.42 (element-aware BBC) vs 15.86 (dataset-specific BBC)	GPT-3 on dataset-specific BBC	+9.56	BBC XSum	Table 3: GPT-3 ROUGE-L 25.42 (element-aware) vs 15.86 (dataset-specific)	Table 3

What To Try In 7 Days

Run your LLM in two stages: first prompt for core facts (entity/date/event/result), then ask it to integrate them into a summary.

Validate automatic metric changes using a small element-aware reference set you create for your domain.

Use SumCoT-style prompts to increase factual coverage in news-like summarization before investing in fine-tuning.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Alsace08/SumCoT

Data URLs

https://github.com/Alsace08/SumCoT

Risks & Boundaries

Limitations

Element-aware test sets are small (200 examples per dataset) and not fully domain-balanced, so results may not generalize to other news types.

SumCoT extraction has notable errors: date hallucination and redundant or non-important element extraction.

When Not To Use

If you only have small LLMs (models smaller than ~175B), because element extraction fails for small sizes.

When you need strict date accuracy—SumCoT shows date hallucination risk during extraction.

Failure Modes

Date hallucination: model invents dates when none exist or miscomputes relative dates.

Element redundancy: extraction stage returns many faithful but low-importance facts that bloat summaries.

Core Entities

Models

GPT-3 (text-davinci-002, 175B)BART (base, large)T5-LARGEPEGASUS-LARGE

Metrics

ROUGE-1ROUGE-2ROUGE-LBERTSCORE

Datasets

CNN/DailyMailBBC XSumElement-aware test sets (this paper)

Benchmarks

Element-aware summarization test sets (200 samples each)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Expert element-aware references strongly improve element coverage vs original references.

GPT-3 zero-shot looks systematically stronger when evaluated against element-aware references.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding