Build expert element-based test sets and use a chain-of-thought prompt (SumCoT) to get LLMs to write more complete news summaries

May 22, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Yiming Wang, Zhuosheng Zhang, Rui Wang

Links

Abstract / PDF

Why It Matters For Business

Cleaner, element-focused references and simple two-stage prompts can make off-the-shelf LLMs produce summaries that include the key facts businesses care about (who/when/what/result) without extra training.

Summary TLDR

The authors create expert-written, element-aware test sets for CNN/DailyMail and BBC XSum that emphasize Who/When/What/Result. Using these tests, they show large LLMs (GPT-3) are much stronger at zero-shot summarization than standard test sets suggest. They introduce SumCoT, a two-stage chain-of-thought prompt that first extracts core elements (Entity, Date, Event, Result) and then integrates them into a summary. SumCoT raises ROUGE and human-quality scores vs. GPT-3 zero-shot and outperforms fine-tuned baselines on the new test sets.

Problem Statement

Standard news summarization test sets contain noisy or incomplete reference summaries (redundancy, hallucinations). That noise hides or mis-measures how well large LLMs can write summaries zero-shot. We need cleaner, element-focused references and prompts that make LLMs include fine-grained facts.

Main Contribution

Released element-aware expert test sets for 200 examples each from CNN/DailyMail and BBC XSum focused on four core elements: Entity, Date, Event, Result.

Showed that GPT-3 zero-shot summaries score much higher against element-aware references than against original references, revealing evaluation blind spots.

Proposed SumCoT: a two-stage chain-of-thought prompt that extracts core elements then integrates them to generate more complete summaries, improving automatic metrics and human judgments.

Key Findings

Expert element-aware references strongly improve element coverage vs original references.

NumbersCNN Entity F1 0.98 vs 0.68; Date F1 0.90 vs 0.69 (Table 2)

GPT-3 zero-shot looks systematically stronger when evaluated against element-aware references.

NumbersROUGE-L gain vs dataset-specific: +6.74 (CNN), +9.56 (BBC) for GPT-3 (Table 3)

SumCoT (element-extract then summarize) meaningfully improves GPT-3 summaries.

NumbersROUGE-L increases by +4.33 (CNN) and +4.77 (BBC) over GPT-3 baseline (Table 5)

Final summaries preserve the extracted elements at high rates, especially on CNN/DailyMail.

NumbersCoverage (fraction of extracted elements appearing in final summary): Entity 0.89, Event 0.93, Result 0.95 on CNN (Table

Results

ROUGE-L (GPT-3 element-aware vs dataset-specific)

Value34.25 (element-aware CNN) vs 27.51 (dataset-specific CNN)

BaselineGPT-3 on dataset-specific CNN

ROUGE-L (GPT-3 element-aware vs dataset-specific)

Value25.42 (element-aware BBC) vs 15.86 (dataset-specific BBC)

BaselineGPT-3 on dataset-specific BBC

ROUGE-L (SumCoT vs GPT-3 baseline)

Value38.67 (GPT-3 with SumCoT on CNN)

BaselineGPT-3 zero-shot

ROUGE-L (SumCoT vs GPT-3 baseline)

Value30.19 (GPT-3 with SumCoT on BBC)

BaselineGPT-3 zero-shot

Element extraction F1 (GPT-3, element extraction stage)

ValueEntity F1 0.83; Date F1 0.55; Event F1 0.83; Result F1 0.76 (CNN)

Element coverage (fraction of extracted elements appearing in final summary)

ValueEntity 0.89; Date 0.55; Event 0.93; Result 0.95 (CNN)

Who Should Care

What To Try In 7 Days

Run your LLM in two stages: first prompt for core facts (entity/date/event/result), then ask it to integrate them into a summary.

Validate automatic metric changes using a small element-aware reference set you create for your domain.

Use SumCoT-style prompts to increase factual coverage in news-like summarization before investing in fine-tuning.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Element-aware test sets are small (200 examples per dataset) and not fully domain-balanced, so results may not generalize to other news types.
  • SumCoT extraction has notable errors: date hallucination and redundant or non-important element extraction.
  • Effectiveness depends on large model scale; smaller GPT-3 variants performed poorly on element extraction.

When Not To Use

  • If you only have small LLMs (models smaller than ~175B), because element extraction fails for small sizes.
  • When you need strict date accuracy—SumCoT shows date hallucination risk during extraction.
  • If you require large, domain-diverse test sets for benchmark-level claims; this paper's test sets are small.

Failure Modes

  • Date hallucination: model invents dates when none exist or miscomputes relative dates.
  • Element redundancy: extraction stage returns many faithful but low-importance facts that bloat summaries.
  • Prompt sensitivity: SumCoT results vary with prompt wording and may be brittle across domains.

Core Entities

Models

  • GPT-3 (text-davinci-002, 175B)
  • BART (base, large)
  • T5-LARGE
  • PEGASUS-LARGE

Metrics

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • BERTSCORE

Datasets

  • CNN/DailyMail
  • BBC XSum
  • Element-aware test sets (this paper)

Benchmarks

  • Element-aware summarization test sets (200 samples each)