Overview
Production Readiness
0.35
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
5
Why It Matters For Business
LLMs offer a fast, cheap proxy to human evaluation and outperform classical automatic metrics on many signals, but they can mislead product decisions when models are close in quality or when systems are very strong; use LLM-based scores for rough triage and keep humans in the loop for final judgments.
Summary TLDR
The authors test ChatGPT and GPT-4 as zero-shot graders for abstractive summarization (Likert-style RTS and MCQ plus head-to-head). LLM-based scores correlate better with humans than many classic metrics and pick the correct winner in most coarse comparisons (ChatGPT-RTS: 58.5/66 pairs, 88.6%). But LLMs are unstable: scores vary by evaluated system and by evaluation dimension, they struggle on closely matched systems (63.6% on hard pairs), and they become less aligned with humans for very high-quality summaries. The paper proposes using the RTS–MCQ agreement as a cheap reliability check and releases code and generations.
Problem Statement
Can off-the-shelf LLMs reliably replace human judges for abstractive summarization? The paper tests ChatGPT and GPT-4 as zero-shot evaluators across coherence, consistency (factuality), fluency, and relevance, and quantifies stability, bias across candidate systems, and failure modes.
Main Contribution
Comprehensive evaluation of ChatGPT and GPT-4 as zero-shot summarization evaluators across four human dimensions (coherence, consistency, fluency, relevance).
Introduce and use a meta-correlation metric that measures whether an evaluator's human-alignment varies with candidate quality.
Show concrete failure modes: candidate-dependence, dimension-dependence, poor discrimination on close systems, and worse alignment on high-quality summaries.
Propose a practical, low-cost reliability check: compute correlation between RTS and MCQ scores per candidate and trigger human review if agreement is low.
Release code and LLM generations to reproduce experiments.
Key Findings
LLM evaluators correlate better with humans than many automatic metrics.
ChatGPT-RTS picks the human-preferred system in most coarse comparisons but fails on close pairs.
LLM evaluation alignment varies strongly by evaluated system and dimension.
LLMs become less human-aligned as candidate summary quality rises for some dimensions.
RTS produces lower absolute scores and sometimes unrelated/false reasoning; MCQ is more optimistic but prevents some bad penalization.
Results
Correct preferences (#CP) - ChatGPT-RTS
Correct preferences (#CP) - ChatGPT-RTS on close pairs
Spearman correlation (ChatGPT-RTS)
Spearman correlation (GPT-4-RTS)
Meta-correlation (LLM)
Average scores (ChatGPT)
Who Should Care
What To Try In 7 Days
Run ChatGPT (RTS and MCQ) to rank candidate summarizers and compute per-candidate RTS–MCQ correlation R_i as a reliability check.
Flag candidates with low RTS–MCQ agreement (R_i below chosen tolerance) and run targeted human evaluation only on those.
Use H2H LLM comparisons only for large gaps; avoid LLM-only decisions for near-equal systems and high-quality models.
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation uses a single benchmark (SummEval) with 12 systems and 100 summaries each; per-system meta-correlation may shift with larger datasets.
- Human reference is the average of three experts; human bias may propagate to measured alignment.
- Prompt design affects outcomes; paper did not exhaustively optimize prompts.
- Dependence on commercial LLM snapshots (API versions) limits exact reproducibility over time.
When Not To Use
- To decide between closely matched systems (small performance gap).
- To fully replace humans when evaluating very high-quality summarizers.
- When RTS and MCQ disagree substantially for a candidate (low R_i).
- For absolute scoring without calibration (RTS scores are conservative).
Failure Modes
- Candidate-dependence: evaluator aligns unevenly across systems.
- Dimension-dependence: different accuracy across coherence/consistency/fluency/relevance.
- False or unrelated reasoning from RTS leading to unjustified penalization.
- Negative meta-correlation: poorer alignment on higher-quality candidates.
- Overly generous scoring in GPT-4 on some dimensions, causing ceiling effects.
Core Entities
Models
- ChatGPT (gpt-3.5-turbo-0301)
- GPT-4 (gpt-4-0314)
- Llama 2 (7B, 13B, 70B)
Metrics
- ROUGE-1/2/L
- BERTScore
- BARTScore
- BARTScore-CNN
- BARTScore-CNN-PARA
- ChatGPT-RTS (reason-then-score)
- ChatGPT-MCQ (multiple-choice)
- H2H head-to-head
- meta-correlation (new)
Datasets
- SummEval
- CNN/DM
Benchmarks
- SummEval (1200 summaries from 12 systems)

