ChatGPT/GPT-4 beat classic metrics but are unstable evaluators for abstractive summarization

May 22, 20238 min

Overview

Production Readiness

0.35

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

5

Authors

Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing

Links

Abstract / PDF

Why It Matters For Business

LLMs offer a fast, cheap proxy to human evaluation and outperform classical automatic metrics on many signals, but they can mislead product decisions when models are close in quality or when systems are very strong; use LLM-based scores for rough triage and keep humans in the loop for final judgments.

Summary TLDR

The authors test ChatGPT and GPT-4 as zero-shot graders for abstractive summarization (Likert-style RTS and MCQ plus head-to-head). LLM-based scores correlate better with humans than many classic metrics and pick the correct winner in most coarse comparisons (ChatGPT-RTS: 58.5/66 pairs, 88.6%). But LLMs are unstable: scores vary by evaluated system and by evaluation dimension, they struggle on closely matched systems (63.6% on hard pairs), and they become less aligned with humans for very high-quality summaries. The paper proposes using the RTS–MCQ agreement as a cheap reliability check and releases code and generations.

Problem Statement

Can off-the-shelf LLMs reliably replace human judges for abstractive summarization? The paper tests ChatGPT and GPT-4 as zero-shot evaluators across coherence, consistency (factuality), fluency, and relevance, and quantifies stability, bias across candidate systems, and failure modes.

Main Contribution

Comprehensive evaluation of ChatGPT and GPT-4 as zero-shot summarization evaluators across four human dimensions (coherence, consistency, fluency, relevance).

Introduce and use a meta-correlation metric that measures whether an evaluator's human-alignment varies with candidate quality.

Show concrete failure modes: candidate-dependence, dimension-dependence, poor discrimination on close systems, and worse alignment on high-quality summaries.

Propose a practical, low-cost reliability check: compute correlation between RTS and MCQ scores per candidate and trigger human review if agreement is low.

Release code and LLM generations to reproduce experiments.

Key Findings

LLM evaluators correlate better with humans than many automatic metrics.

NumbersChatGPT-RTS Spearman up to 0.448 (relevance); fluency gains vs baselines up to +0.2

ChatGPT-RTS picks the human-preferred system in most coarse comparisons but fails on close pairs.

Numbers58.5/66 correct pairs (88.6%) on full set; 7/11 (63.6%) on close challenge pairs

LLM evaluation alignment varies strongly by evaluated system and dimension.

NumbersPer-candidate correlation spread up to ~0.5 (consistency)

LLMs become less human-aligned as candidate summary quality rises for some dimensions.

NumbersNegative meta-correlation (significant) for consistency and fluency; LLMs align better for systems with avg human score<

RTS produces lower absolute scores and sometimes unrelated/false reasoning; MCQ is more optimistic but prevents some bad penalization.

NumbersAverage ChatGPT scores: RTS avg 2.52 vs human 3.35; MCQ avg 3.68 (Table 25); non-trivial fraction of RTS responses had '

Results

Correct preferences (#CP) - ChatGPT-RTS

Value58.5/66 (88.6%)

Baselinerandom (~33%)

Correct preferences (#CP) - ChatGPT-RTS on close pairs

Value7/11 (63.6%)

Baselinebest baselines on same set

Spearman correlation (ChatGPT-RTS)

Valuecoherence 0.388, consistency 0.423, fluency 0.285, relevance 0.448

BaselineBARTScore-CNN coherence 0.461 (best neural baseline)

Spearman correlation (GPT-4-RTS)

Valuecoherence 0.427, consistency 0.556, fluency 0.498, relevance 0.448

BaselineChatGPT-RTS above

Meta-correlation (LLM)

Valuesignificant negative meta-correlation for consistency and fluency

BaselineROUGE metrics show no significant negative meta-correlation

Average scores (ChatGPT)

ValueRTS avg 2.52, MCQ avg 3.68, human avg 3.35 (across dims)

Baselinehuman scores

Who Should Care

What To Try In 7 Days

Run ChatGPT (RTS and MCQ) to rank candidate summarizers and compute per-candidate RTS–MCQ correlation R_i as a reliability check.

Flag candidates with low RTS–MCQ agreement (R_i below chosen tolerance) and run targeted human evaluation only on those.

Use H2H LLM comparisons only for large gaps; avoid LLM-only decisions for near-equal systems and high-quality models.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses a single benchmark (SummEval) with 12 systems and 100 summaries each; per-system meta-correlation may shift with larger datasets.
  • Human reference is the average of three experts; human bias may propagate to measured alignment.
  • Prompt design affects outcomes; paper did not exhaustively optimize prompts.
  • Dependence on commercial LLM snapshots (API versions) limits exact reproducibility over time.

When Not To Use

  • To decide between closely matched systems (small performance gap).
  • To fully replace humans when evaluating very high-quality summarizers.
  • When RTS and MCQ disagree substantially for a candidate (low R_i).
  • For absolute scoring without calibration (RTS scores are conservative).

Failure Modes

  • Candidate-dependence: evaluator aligns unevenly across systems.
  • Dimension-dependence: different accuracy across coherence/consistency/fluency/relevance.
  • False or unrelated reasoning from RTS leading to unjustified penalization.
  • Negative meta-correlation: poorer alignment on higher-quality candidates.
  • Overly generous scoring in GPT-4 on some dimensions, causing ceiling effects.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo-0301)
  • GPT-4 (gpt-4-0314)
  • Llama 2 (7B, 13B, 70B)

Metrics

  • ROUGE-1/2/L
  • BERTScore
  • BARTScore
  • BARTScore-CNN
  • BARTScore-CNN-PARA
  • ChatGPT-RTS (reason-then-score)
  • ChatGPT-MCQ (multiple-choice)
  • H2H head-to-head
  • meta-correlation (new)

Datasets

  • SummEval
  • CNN/DM

Benchmarks

  • SummEval (1200 summaries from 12 systems)