ChatGPT/GPT-4 beat classic metrics but are unstable evaluators for abstractive summarization

Overview

Decision SnapshotNeeds Validation

The experiments use public SummEval data and fixed LLM snapshots; results are robust across ChatGPT/GPT-4 but limited to 12 candidate systems and 100 summaries each, so findings are well supported but not universal.

Citations5

Evidence Strength0.85

Confidence0.86

Risk Signals13

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 35%

Novelty: 50%

Authors

Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs offer a fast, cheap proxy to human evaluation and outperform classical automatic metrics on many signals, but they can mislead product decisions when models are close in quality or when systems are very strong; use LLM-based scores for rough triage and keep humans in the loop for final judgments.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

The authors test ChatGPT and GPT-4 as zero-shot graders for abstractive summarization (Likert-style RTS and MCQ plus head-to-head). LLM-based scores correlate better with humans than many classic metrics and pick the correct winner in most coarse comparisons (ChatGPT-RTS: 58.5/66 pairs, 88.6%). But LLMs are unstable: scores vary by evaluated system and by evaluation dimension, they struggle on closely matched systems (63.6% on hard pairs), and they become less aligned with humans for very high-quality summaries. The paper proposes using the RTS–MCQ agreement as a cheap reliability check and releases code and generations.

Problem Statement

Can off-the-shelf LLMs reliably replace human judges for abstractive summarization? The paper tests ChatGPT and GPT-4 as zero-shot evaluators across coherence, consistency (factuality), fluency, and relevance, and quantifies stability, bias across candidate systems, and failure modes.

Main Contribution

Comprehensive evaluation of ChatGPT and GPT-4 as zero-shot summarization evaluators across four human dimensions (coherence, consistency, fluency, relevance).

Introduce and use a meta-correlation metric that measures whether an evaluator's human-alignment varies with candidate quality.

Key Findings

LLM evaluators correlate better with humans than many automatic metrics.

NumbersChatGPT-RTS Spearman up to 0.448 (relevance); fluency gains vs baselines up to +0.2

Practical UseIf you must use an automatic metric, ChatGPT/GPT-4 give stronger human correlation than ROUGE/BERTScore on these dimensions — useful for coarse, faster comparisons.

Evidence RefTable 4; §4.2.2

ChatGPT-RTS picks the human-preferred system in most coarse comparisons but fails on close pairs.

Numbers58.5/66 correct pairs (88.6%) on full set; 7/11 (63.6%) on close challenge pairs

Practical UseUse LLM evaluators to detect large quality gaps, not to decide between near-equal systems — add humans for tight comparisons.

Evidence RefTable 5; §4.2.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Correct preferences (#CP) - ChatGPT-RTS	58.5/66 (88.6%)	random (~33%)	≈+55 percentage points vs random	66-pair full set on SummEval	ChatGPT-RTS obtains largest #CP across dimensions	Table 5
Correct preferences (#CP) - ChatGPT-RTS on close pairs	7/11 (63.6%)	best baselines on same set	performance drops vs full set	11-pair challenge set	LLMs struggle to differentiate closely matched systems	§4.2.1; Table 5

What To Try In 7 Days

Run ChatGPT (RTS and MCQ) to rank candidate summarizers and compute per-candidate RTS–MCQ correlation R_i as a reliability check.

Flag candidates with low RTS–MCQ agreement (R_i below chosen tolerance) and run targeted human evaluation only on those.

Use H2H LLM comparisons only for large gaps; avoid LLM-only decisions for near-equal systems and high-quality models.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/DAMO-NLP-SG/LLM_summeval

Data URLs

https://github.com/DAMO-NLP-SG/LLM_summeval https://github.com/tingofurro/SummEval (SummEval dataset link referenced)

Risks & Boundaries

Limitations

Evaluation uses a single benchmark (SummEval) with 12 systems and 100 summaries each; per-system meta-correlation may shift with larger datasets.

Human reference is the average of three experts; human bias may propagate to measured alignment.

When Not To Use

To decide between closely matched systems (small performance gap).

To fully replace humans when evaluating very high-quality summarizers.

Failure Modes

Candidate-dependence: evaluator aligns unevenly across systems.

Dimension-dependence: different accuracy across coherence/consistency/fluency/relevance.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-0301)GPT-4 (gpt-4-0314)Llama 2 (7B, 13B, 70B)

Metrics

ROUGE-1/2/LBERTScoreBARTScoreBARTScore-CNNBARTScore-CNN-PARAChatGPT-RTS (reason-then-score)ChatGPT-MCQ (multiple-choice)H2H head-to-headmeta-correlation (new)

Datasets

SummEvalCNN/DM

Benchmarks

SummEval (1200 summaries from 12 systems)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM evaluators correlate better with humans than many automatic metrics.

ChatGPT-RTS picks the human-preferred system in most coarse comparisons but fails on close pairs.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding