ChatGPT can judge summary factuality zero‑shot but shows lexical bias, false reasoning, and prompt sensitivity

March 27, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

50

Authors

Zheheng Luo, Qianqian Xie, Sophia Ananiadou

Links

Abstract / PDF

Why It Matters For Business

ChatGPT offers a ready-to-use, zero-shot factuality evaluator that can reduce annotation and training costs and often aligns better with human judgments, but it needs calibration for paraphrase-heavy or domain-specific text.

Summary TLDR

This study tests ChatGPT (gpt-3.5-turbo-0301) as a zero-shot factuality judge for summarization across three tasks: binary entailment (is the summary entailed?), pairwise ranking (which of two is faithful?), and numeric consistency rating (1–10). ChatGPT with a chain-of-thought (CoT) prompt often matches or beats specialized metrics on public benchmarks (SUMMAC, FRANK, SummEval, CNN/DM ranking). Strengths: strong correlation with human ratings and high ranking accuracy. Weaknesses: a strong bias toward lexical overlap, missed subtle paraphrase errors, occasional false inferences, and unstable adherence to prompt instructions. Use CoT-style prompts and domain calibration before trusting in‑s​

Problem Statement

Existing factuality metrics either need lots of annotated data or heavy pipelines and often disagree with humans. The paper asks: can ChatGPT serve as a ready, zero-shot evaluator for factual inconsistency in summaries, and what are its limits?

Main Contribution

Systematic zero-shot evaluation of ChatGPT on three factuality tasks: entailment inference, summary ranking, and consistency rating.

Shows chain-of-thought prompts improve ChatGPT's factuality judgments.

Compares ChatGPT to standard factuality metrics on SUMMAC, FRANK, SummEval and ranking data.

Analyzes failure modes: lexical bias, false reasoning, and prompt-following gaps.

Key Findings

ChatGPT (zero-shot + CoT) often matches or beats prior factuality metrics on multiple benchmarks.

NumbersCoGenSum BA 74.3% vs SummaC ZS 70.4%; SummEval 83.3% vs 78.7% (Table 2)

ChatGPT favors lexical overlap and misses abstractive/paraphrase inconsistencies.

NumbersXSumFaith BA 63.1% (ChatGPT ZS-COT) vs SummaC Conv 66.4%; specificity >95% but low sensitivity on 5/6 datasets (Fig.1, §

ChatGPT scores align strongly with human consistency ratings on FRANK.

NumbersPearson ρ=0.70, Spearman r=0.69 on FRANK (Table 4)

Chain-of-thought prompting substantially boosts performance over direct prompts.

NumbersChatGPTZS-COT improves over ChatGPTZS by +11.0% (CoGenSum) and other dataset deltas (Table 2)

ChatGPT sometimes produces false or post-hoc reasoning and can ignore prompt definitions.

Results

Accuracy

Value74.3% (ChatGPT ZS-COT on CoGenSum)

BaselineSummaC ZS 70.4%

Accuracy

Value83.3% (ChatGPT ZS-COT on SummEval)

BaselineSummaC ZS 78.7%

Accuracy

Value63.1% (ChatGPT ZS-COT on XSumFaith)

BaselineSummaC Conv 66.4%

Accuracy

Value82.6% (ChatGPT ZS-COT on FRANK)

BaselineSummaC ZS 79.0%

Accuracy

Value85.2% (ChatGPT)

BaselineHuman 83.9%; DAE 83.6%

Correlation with human ratings (consistency score)

ValuePearson ρ=0.70, Spearman r=0.69 on FRANK

BaselineFactCC Pearson ρ=0.20

Specificity vs Sensitivity (entailment inference)

ValueSpecificity >95% on 5/6 datasets; sensitivity much lower

Who Should Care

What To Try In 7 Days

Run ChatGPT ZS-COT prompts on a sample of your summaries and compare to your current metric on balanced accuracy and correlation with human labels.

Test sensitivity by creating paraphrase-based errors (abstractive changes) to see if ChatGPT misses them.

Use ChatGPT ranking prompts to prioritize summaries and inspect top disagreements manually for patterns of failure.

Reproducibility

Data Urls

  • SUMMAC datasets referenced (FactCC, CoGenSumm, XSumFaith, SummEval, FRANK, Polytope)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only zero-shot prompts on ChatGPT API were tested; no few-shot or comparison with GPT-4.
  • Limited scope due to API cost; some prompt variations not explored.
  • ChatGPT shows lexical bias and misses paraphrase-based inconsistencies.
  • ChatGPT can produce false or post-hoc explanations and may not follow prompt definitions strictly.

When Not To Use

  • When summaries are highly abstractive or paraphrased with low lexical overlap
  • For high-stakes factual verification without human review
  • If you require deterministic, explainable rule-based checks

Failure Modes

  • False positive consistency when lexical overlap is high but meaning changed
  • False reasoning: justifying incorrect conclusions post-hoc
  • Poor sensitivity to inconsistent examples (misses many errors)
  • Inability to consistently follow rating definitions in prompts

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo-0301)

Metrics

  • Accuracy
  • Sensitivity (recall+)
  • Specificity (recall-)
  • Pearson correlation
  • Spearman correlation
  • Kendall Tau

Datasets

  • SUMMAC (CoGenSumm, XSumFaith, Polytope, FactCC, SummEval, FRANK)
  • Falke ranking dataset (CNN/DM samples)
  • SummEval
  • FRANK

Benchmarks

  • SUMMAC

Context Entities

Models

  • SummaC (ZS and Conv)
  • QuestEval
  • FactCC
  • FEQA
  • DAE
  • MNLI-doc
  • NER Overlap
  • QAGS
  • BARTScore (mentioned)

Metrics

  • ROUGE (mentioned as baseline for quality)
  • BERTScore (mentioned)

Datasets

  • CNN/DM (used for ranking)
  • XSum (as origin of abstractive outputs)

Benchmarks

  • SUMMAC (Laban et al., 2022)