Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.7
Citation Count
89
Why It Matters For Business
You can often skip costly fine-tuning and get usable aspect/query summaries by prompting ChatGPT zero-shot, but expect issues with very short target summaries and long documents unless you add retrieval or truncation.
Summary TLDR
The authors test ChatGPT (web interface) on four query- or aspect-based summarization datasets (QMSum, SQuaLITY, CovidET, NEWTS). Using zero-shot prompts (one-shot for CovidET), ChatGPT achieves ROUGE scores comparable to standard fine-tuned systems on most datasets, even exceeding baselines when given focused (golden) input spans. ChatGPT struggles on very short, single-sentence aspect outputs (CovidET) and is limited by input length, so truncation or retrieval is needed for long documents. The paper reports automatic metrics and surface analyses but no human evaluation yet.
Problem Statement
Can ChatGPT, used with simple prompts and no fine-tuning, produce accurate aspect- or query-focused summaries across diverse domains (meetings, stories, news, Reddit)? The paper tests whether zero-shot ChatGPT matches or beats traditional fine-tuned models on standard metrics and where it fails.
Main Contribution
First systematic evaluation of ChatGPT on aspect- and query-based summarization across four public datasets.
Shows zero-shot ChatGPT attains ROUGE scores comparable to or better than fine-tuned baselines on many tasks, especially with focused input.
Analyzes differences in style and extractiveness (compression, coverage, n-grams) and surfaces limitations (input length, short-aspect tasks).
Key Findings
Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.
Giving focused input spans (golden annotations) improves ChatGPT and can outperform fine-tuning on QMSum.
ChatGPT performs poorly on very short aspect summaries (CovidET) compared to fine-tuned models.
ChatGPT tends to produce longer, more abstractive outputs and different phrasing than references.
Results
ROUGE-1 (NEWTS)
ROUGE-1 (QMSum, golden spans)
ROUGE-1 (SQuaLITY)
ROUGE-1 (CovidET)
Compression ratio (NEWTS)
Who Should Care
What To Try In 7 Days
Run a 100-example pilot: compare ChatGPT zero-shot vs your current fine-tuned model on your target aspects.
If docs are long, add a lightweight retrieval step to supply relevant spans before prompting ChatGPT.
For short, single-sentence aspects, test one-shot in-context examples or keep fine-tuning as fallback.
Reproducibility
Data Urls
- QMSum
- SQuaLITY
- CovidET
- NEWTS
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No human evaluation yet; conclusions rely on automatic metrics.
- Input length limits forced truncation or extraction for long documents.
- Some datasets (CovidET) require very short outputs where ChatGPT underperforms.
- Small manual sample size: 100 random examples per test set on the web interface.
When Not To Use
- When you need concise, one-sentence aspect summaries without extra tuning.
- When strict input-length guarantees or deterministic outputs are required.
- When you need audited, reproducible model runs via API (paper used web UI).
Failure Modes
- Verbose or overly formal summaries that lower ROUGE-L for dialogues.
- Missed answers when relevant content is truncated and ChatGPT returns 'cannot answer'.
- Non-factual or biased statements not covered by automatic metrics.
Core Entities
Models
- ChatGPT
Metrics
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Coverage
- Density
- Compression
- Unique n-grams
Datasets
- QMSum
- SQuaLITY
- CovidET
- NEWTS
Benchmarks
- query/aspect-based summarization (QMSum, SQuaLITY, CovidET, NEWTS)
Context Entities
Models
- Fine-tuned baselines (unspecified models)

