Overview
ChatGPT is a practical zero-shot option for many aspect/query summarization tasks, but validate on your short-aspect cases and long-document workflow.
Citations89
Evidence Strength0.60
Confidence0.75
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 30%
Why It Matters For Business
You can often skip costly fine-tuning and get usable aspect/query summaries by prompting ChatGPT zero-shot, but expect issues with very short target summaries and long documents unless you add retrieval or truncation.
Who Should Care
Summary TLDR
The authors test ChatGPT (web interface) on four query- or aspect-based summarization datasets (QMSum, SQuaLITY, CovidET, NEWTS). Using zero-shot prompts (one-shot for CovidET), ChatGPT achieves ROUGE scores comparable to standard fine-tuned systems on most datasets, even exceeding baselines when given focused (golden) input spans. ChatGPT struggles on very short, single-sentence aspect outputs (CovidET) and is limited by input length, so truncation or retrieval is needed for long documents. The paper reports automatic metrics and surface analyses but no human evaluation yet.
Problem Statement
Can ChatGPT, used with simple prompts and no fine-tuning, produce accurate aspect- or query-focused summaries across diverse domains (meetings, stories, news, Reddit)? The paper tests whether zero-shot ChatGPT matches or beats traditional fine-tuned models on standard metrics and where it fails.
Main Contribution
First systematic evaluation of ChatGPT on aspect- and query-based summarization across four public datasets.
Shows zero-shot ChatGPT attains ROUGE scores comparable to or better than fine-tuned baselines on many tasks, especially with focused input.
Key Findings
Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.
Giving focused input spans (golden annotations) improves ChatGPT and can outperform fine-tuning on QMSum.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ROUGE-1 (NEWTS) | 32.54 | 31.78 (Fine-tuning) | +0.76 | NEWTS (news topic-focused) | Table 2 shows ChatGPT 32.54 vs FT 31.78 | Table 2 |
| ROUGE-1 (QMSum, golden spans) | 36.83 | 36.06 (Fine-tuning on same spans) | +0.77 | QMSum (meeting) with golden spans | Table 2 reports higher R-1 for ChatGPT when given golden spans | Table 2 |
What To Try In 7 Days
Run a 100-example pilot: compare ChatGPT zero-shot vs your current fine-tuned model on your target aspects.
If docs are long, add a lightweight retrieval step to supply relevant spans before prompting ChatGPT.
For short, single-sentence aspects, test one-shot in-context examples or keep fine-tuning as fallback.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
No human evaluation yet; conclusions rely on automatic metrics.
Input length limits forced truncation or extraction for long documents.
When Not To Use
When you need concise, one-sentence aspect summaries without extra tuning.
When strict input-length guarantees or deterministic outputs are required.
Failure Modes
Verbose or overly formal summaries that lower ROUGE-L for dialogues.
Missed answers when relevant content is truncated and ChatGPT returns 'cannot answer'.

