ChatGPT often matches fine-tuned models on query/aspect summarization using zero-shot prompts

February 16, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.7

Citation Count

89

Authors

Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, Wei Cheng

Links

Abstract / PDF

Why It Matters For Business

You can often skip costly fine-tuning and get usable aspect/query summaries by prompting ChatGPT zero-shot, but expect issues with very short target summaries and long documents unless you add retrieval or truncation.

Summary TLDR

The authors test ChatGPT (web interface) on four query- or aspect-based summarization datasets (QMSum, SQuaLITY, CovidET, NEWTS). Using zero-shot prompts (one-shot for CovidET), ChatGPT achieves ROUGE scores comparable to standard fine-tuned systems on most datasets, even exceeding baselines when given focused (golden) input spans. ChatGPT struggles on very short, single-sentence aspect outputs (CovidET) and is limited by input length, so truncation or retrieval is needed for long documents. The paper reports automatic metrics and surface analyses but no human evaluation yet.

Problem Statement

Can ChatGPT, used with simple prompts and no fine-tuning, produce accurate aspect- or query-focused summaries across diverse domains (meetings, stories, news, Reddit)? The paper tests whether zero-shot ChatGPT matches or beats traditional fine-tuned models on standard metrics and where it fails.

Main Contribution

First systematic evaluation of ChatGPT on aspect- and query-based summarization across four public datasets.

Shows zero-shot ChatGPT attains ROUGE scores comparable to or better than fine-tuned baselines on many tasks, especially with focused input.

Analyzes differences in style and extractiveness (compression, coverage, n-grams) and surfaces limitations (input length, short-aspect tasks).

Key Findings

Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.

NumbersNEWTS R-1: ChatGPT 32.54 vs FT 31.78 (Table 2)

Giving focused input spans (golden annotations) improves ChatGPT and can outperform fine-tuning on QMSum.

NumbersQMSum (golden) R-1: ChatGPT 36.83 vs FT 36.06 (Table 2)

ChatGPT performs poorly on very short aspect summaries (CovidET) compared to fine-tuned models.

NumbersCovidET R-1: ChatGPT 20.81 vs FT 26.19 (Table 2)

ChatGPT tends to produce longer, more abstractive outputs and different phrasing than references.

NumbersNEWTS compression: ChatGPT 4.03 vs Reference 9.66 (Table 3)

Results

ROUGE-1 (NEWTS)

Value32.54

Baseline31.78 (Fine-tuning)

ROUGE-1 (QMSum, golden spans)

Value36.83

Baseline36.06 (Fine-tuning on same spans)

ROUGE-1 (SQuaLITY)

Value37.02

Baseline≈38.0 (Fine-tuning reported nearby)

ROUGE-1 (CovidET)

Value20.81

Baseline26.19 (Fine-tuning)

Compression ratio (NEWTS)

Value4.03 (ChatGPT)

Baseline9.66 (Reference)

Who Should Care

What To Try In 7 Days

Run a 100-example pilot: compare ChatGPT zero-shot vs your current fine-tuned model on your target aspects.

If docs are long, add a lightweight retrieval step to supply relevant spans before prompting ChatGPT.

For short, single-sentence aspects, test one-shot in-context examples or keep fine-tuning as fallback.

Reproducibility

Data Urls

  • QMSum
  • SQuaLITY
  • CovidET
  • NEWTS

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No human evaluation yet; conclusions rely on automatic metrics.
  • Input length limits forced truncation or extraction for long documents.
  • Some datasets (CovidET) require very short outputs where ChatGPT underperforms.
  • Small manual sample size: 100 random examples per test set on the web interface.

When Not To Use

  • When you need concise, one-sentence aspect summaries without extra tuning.
  • When strict input-length guarantees or deterministic outputs are required.
  • When you need audited, reproducible model runs via API (paper used web UI).

Failure Modes

  • Verbose or overly formal summaries that lower ROUGE-L for dialogues.
  • Missed answers when relevant content is truncated and ChatGPT returns 'cannot answer'.
  • Non-factual or biased statements not covered by automatic metrics.

Core Entities

Models

  • ChatGPT

Metrics

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • Coverage
  • Density
  • Compression
  • Unique n-grams

Datasets

  • QMSum
  • SQuaLITY
  • CovidET
  • NEWTS

Benchmarks

  • query/aspect-based summarization (QMSum, SQuaLITY, CovidET, NEWTS)

Context Entities

Models

  • Fine-tuned baselines (unspecified models)