ChatGPT often matches fine-tuned models on query/aspect summarization using zero-shot prompts

February 16, 20236 min

Overview

Decision SnapshotNeeds Validation

ChatGPT is a practical zero-shot option for many aspect/query summarization tasks, but validate on your short-aspect cases and long-document workflow.

Citations89

Evidence Strength0.60

Confidence0.75

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 30%

Authors

Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, Wei Cheng

Links

Abstract / PDF / Data

Why It Matters For Business

You can often skip costly fine-tuning and get usable aspect/query summaries by prompting ChatGPT zero-shot, but expect issues with very short target summaries and long documents unless you add retrieval or truncation.

Who Should Care

Summary TLDR

The authors test ChatGPT (web interface) on four query- or aspect-based summarization datasets (QMSum, SQuaLITY, CovidET, NEWTS). Using zero-shot prompts (one-shot for CovidET), ChatGPT achieves ROUGE scores comparable to standard fine-tuned systems on most datasets, even exceeding baselines when given focused (golden) input spans. ChatGPT struggles on very short, single-sentence aspect outputs (CovidET) and is limited by input length, so truncation or retrieval is needed for long documents. The paper reports automatic metrics and surface analyses but no human evaluation yet.

Problem Statement

Can ChatGPT, used with simple prompts and no fine-tuning, produce accurate aspect- or query-focused summaries across diverse domains (meetings, stories, news, Reddit)? The paper tests whether zero-shot ChatGPT matches or beats traditional fine-tuned models on standard metrics and where it fails.

Main Contribution

First systematic evaluation of ChatGPT on aspect- and query-based summarization across four public datasets.

Shows zero-shot ChatGPT attains ROUGE scores comparable to or better than fine-tuned baselines on many tasks, especially with focused input.

Key Findings

Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.

NumbersNEWTS R-1: ChatGPT 32.54 vs FT 31.78 (Table 2)

Practical UseTry ChatGPT zero-shot first for aspect/query summarization to avoid fine-tuning costs; validate with a small held-out set.

Evidence RefTable 2

Giving focused input spans (golden annotations) improves ChatGPT and can outperform fine-tuning on QMSum.

NumbersQMSum (golden) R-1: ChatGPT 36.83 vs FT 36.06 (Table 2)

Practical UseUse a retrieval step or supply relevant spans before prompting ChatGPT to boost summary quality.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE-1 (NEWTS)32.5431.78 (Fine-tuning)+0.76NEWTS (news topic-focused)Table 2 shows ChatGPT 32.54 vs FT 31.78Table 2
ROUGE-1 (QMSum, golden spans)36.8336.06 (Fine-tuning on same spans)+0.77QMSum (meeting) with golden spansTable 2 reports higher R-1 for ChatGPT when given golden spansTable 2

What To Try In 7 Days

Run a 100-example pilot: compare ChatGPT zero-shot vs your current fine-tuned model on your target aspects.

If docs are long, add a lightweight retrieval step to supply relevant spans before prompting ChatGPT.

For short, single-sentence aspects, test one-shot in-context examples or keep fine-tuning as fallback.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

QMSumSQuaLITYCovidETNEWTS

Risks & Boundaries

Limitations

No human evaluation yet; conclusions rely on automatic metrics.

Input length limits forced truncation or extraction for long documents.

When Not To Use

When you need concise, one-sentence aspect summaries without extra tuning.

When strict input-length guarantees or deterministic outputs are required.

Failure Modes

Verbose or overly formal summaries that lower ROUGE-L for dialogues.

Missed answers when relevant content is truncated and ChatGPT returns 'cannot answer'.

Core Entities

Models

ChatGPT

Metrics

ROUGE-1ROUGE-2ROUGE-LCoverageDensityCompressionUnique n-grams

Datasets

QMSumSQuaLITYCovidETNEWTS

Benchmarks

query/aspect-based summarization (QMSum, SQuaLITY, CovidET, NEWTS)

Context Entities

Models

Fine-tuned baselines (unspecified models)