ChatGPT often matches fine-tuned models on query/aspect summarization using zero-shot prompts

Overview

Decision SnapshotNeeds Validation

ChatGPT is a practical zero-shot option for many aspect/query summarization tasks, but validate on your short-aspect cases and long-document workflow.

Citations89

Evidence Strength0.60

Confidence0.75

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 30%

Authors

Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, Wei Cheng

Links

Abstract / PDF / Data

Why It Matters For Business

You can often skip costly fine-tuning and get usable aspect/query summaries by prompting ChatGPT zero-shot, but expect issues with very short target summaries and long documents unless you add retrieval or truncation.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors test ChatGPT (web interface) on four query- or aspect-based summarization datasets (QMSum, SQuaLITY, CovidET, NEWTS). Using zero-shot prompts (one-shot for CovidET), ChatGPT achieves ROUGE scores comparable to standard fine-tuned systems on most datasets, even exceeding baselines when given focused (golden) input spans. ChatGPT struggles on very short, single-sentence aspect outputs (CovidET) and is limited by input length, so truncation or retrieval is needed for long documents. The paper reports automatic metrics and surface analyses but no human evaluation yet.

Problem Statement

Can ChatGPT, used with simple prompts and no fine-tuning, produce accurate aspect- or query-focused summaries across diverse domains (meetings, stories, news, Reddit)? The paper tests whether zero-shot ChatGPT matches or beats traditional fine-tuned models on standard metrics and where it fails.

Main Contribution

First systematic evaluation of ChatGPT on aspect- and query-based summarization across four public datasets.

Shows zero-shot ChatGPT attains ROUGE scores comparable to or better than fine-tuned baselines on many tasks, especially with focused input.

Key Findings

Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.

NumbersNEWTS R-1: ChatGPT 32.54 vs FT 31.78 (Table 2)

Practical UseTry ChatGPT zero-shot first for aspect/query summarization to avoid fine-tuning costs; validate with a small held-out set.

Evidence RefTable 2

Giving focused input spans (golden annotations) improves ChatGPT and can outperform fine-tuning on QMSum.

NumbersQMSum (golden) R-1: ChatGPT 36.83 vs FT 36.06 (Table 2)

Practical UseUse a retrieval step or supply relevant spans before prompting ChatGPT to boost summary quality.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE-1 (NEWTS)	32.54	31.78 (Fine-tuning)	+0.76	NEWTS (news topic-focused)	Table 2 shows ChatGPT 32.54 vs FT 31.78	Table 2
ROUGE-1 (QMSum, golden spans)	36.83	36.06 (Fine-tuning on same spans)	+0.77	QMSum (meeting) with golden spans	Table 2 reports higher R-1 for ChatGPT when given golden spans	Table 2

What To Try In 7 Days

Run a 100-example pilot: compare ChatGPT zero-shot vs your current fine-tuned model on your target aspects.

If docs are long, add a lightweight retrieval step to supply relevant spans before prompting ChatGPT.

For short, single-sentence aspects, test one-shot in-context examples or keep fine-tuning as fallback.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

QMSumSQuaLITYCovidETNEWTS

Risks & Boundaries

Limitations

No human evaluation yet; conclusions rely on automatic metrics.

Input length limits forced truncation or extraction for long documents.

When Not To Use

When you need concise, one-sentence aspect summaries without extra tuning.

When strict input-length guarantees or deterministic outputs are required.

Failure Modes

Verbose or overly formal summaries that lower ROUGE-L for dialogues.

Missed answers when relevant content is truncated and ChatGPT returns 'cannot answer'.

Core Entities

Models

ChatGPT

Metrics

ROUGE-1ROUGE-2ROUGE-LCoverageDensityCompressionUnique n-grams

Datasets

QMSumSQuaLITYCovidETNEWTS

Benchmarks

query/aspect-based summarization (QMSum, SQuaLITY, CovidET, NEWTS)

Context Entities

Models

Fine-tuned baselines (unspecified models)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.

Giving focused input spans (golden annotations) improves ChatGPT and can outperform fine-tuning on QMSum.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding