ChatGPT/GPT-4 beat classic metrics but are unstable evaluators for abstractive summarization

May 22, 20238 min

Overview

Decision SnapshotNeeds Validation

The experiments use public SummEval data and fixed LLM snapshots; results are robust across ChatGPT/GPT-4 but limited to 12 candidate systems and 100 summaries each, so findings are well supported but not universal.

Citations5

Evidence Strength0.85

Confidence0.86

Risk Signals13

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 35%

Novelty: 50%

Authors

Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs offer a fast, cheap proxy to human evaluation and outperform classical automatic metrics on many signals, but they can mislead product decisions when models are close in quality or when systems are very strong; use LLM-based scores for rough triage and keep humans in the loop for final judgments.

Who Should Care

Summary TLDR

The authors test ChatGPT and GPT-4 as zero-shot graders for abstractive summarization (Likert-style RTS and MCQ plus head-to-head). LLM-based scores correlate better with humans than many classic metrics and pick the correct winner in most coarse comparisons (ChatGPT-RTS: 58.5/66 pairs, 88.6%). But LLMs are unstable: scores vary by evaluated system and by evaluation dimension, they struggle on closely matched systems (63.6% on hard pairs), and they become less aligned with humans for very high-quality summaries. The paper proposes using the RTS–MCQ agreement as a cheap reliability check and releases code and generations.

Problem Statement

Can off-the-shelf LLMs reliably replace human judges for abstractive summarization? The paper tests ChatGPT and GPT-4 as zero-shot evaluators across coherence, consistency (factuality), fluency, and relevance, and quantifies stability, bias across candidate systems, and failure modes.

Main Contribution

Comprehensive evaluation of ChatGPT and GPT-4 as zero-shot summarization evaluators across four human dimensions (coherence, consistency, fluency, relevance).

Introduce and use a meta-correlation metric that measures whether an evaluator's human-alignment varies with candidate quality.

Key Findings

LLM evaluators correlate better with humans than many automatic metrics.

NumbersChatGPT-RTS Spearman up to 0.448 (relevance); fluency gains vs baselines up to +0.2

Practical UseIf you must use an automatic metric, ChatGPT/GPT-4 give stronger human correlation than ROUGE/BERTScore on these dimensions — useful for coarse, faster comparisons.

Evidence RefTable 4; §4.2.2

ChatGPT-RTS picks the human-preferred system in most coarse comparisons but fails on close pairs.

Numbers58.5/66 correct pairs (88.6%) on full set; 7/11 (63.6%) on close challenge pairs

Practical UseUse LLM evaluators to detect large quality gaps, not to decide between near-equal systems — add humans for tight comparisons.

Evidence RefTable 5; §4.2.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Correct preferences (#CP) - ChatGPT-RTS58.5/66 (88.6%)random (~33%)≈+55 percentage points vs random66-pair full set on SummEvalChatGPT-RTS obtains largest #CP across dimensionsTable 5
Correct preferences (#CP) - ChatGPT-RTS on close pairs7/11 (63.6%)best baselines on same setperformance drops vs full set11-pair challenge setLLMs struggle to differentiate closely matched systems§4.2.1; Table 5

What To Try In 7 Days

Run ChatGPT (RTS and MCQ) to rank candidate summarizers and compute per-candidate RTS–MCQ correlation R_i as a reliability check.

Flag candidates with low RTS–MCQ agreement (R_i below chosen tolerance) and run targeted human evaluation only on those.

Use H2H LLM comparisons only for large gaps; avoid LLM-only decisions for near-equal systems and high-quality models.

Reproducibility

Risks & Boundaries

Limitations

Evaluation uses a single benchmark (SummEval) with 12 systems and 100 summaries each; per-system meta-correlation may shift with larger datasets.

Human reference is the average of three experts; human bias may propagate to measured alignment.

When Not To Use

To decide between closely matched systems (small performance gap).

To fully replace humans when evaluating very high-quality summarizers.

Failure Modes

Candidate-dependence: evaluator aligns unevenly across systems.

Dimension-dependence: different accuracy across coherence/consistency/fluency/relevance.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-0301)GPT-4 (gpt-4-0314)Llama 2 (7B, 13B, 70B)

Metrics

ROUGE-1/2/LBERTScoreBARTScoreBARTScore-CNNBARTScore-CNN-PARAChatGPT-RTS (reason-then-score)ChatGPT-MCQ (multiple-choice)H2H head-to-headmeta-correlation (new)

Datasets

SummEvalCNN/DM

Benchmarks

SummEval (1200 summaries from 12 systems)