Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
50
Why It Matters For Business
ChatGPT offers a ready-to-use, zero-shot factuality evaluator that can reduce annotation and training costs and often aligns better with human judgments, but it needs calibration for paraphrase-heavy or domain-specific text.
Summary TLDR
This study tests ChatGPT (gpt-3.5-turbo-0301) as a zero-shot factuality judge for summarization across three tasks: binary entailment (is the summary entailed?), pairwise ranking (which of two is faithful?), and numeric consistency rating (1–10). ChatGPT with a chain-of-thought (CoT) prompt often matches or beats specialized metrics on public benchmarks (SUMMAC, FRANK, SummEval, CNN/DM ranking). Strengths: strong correlation with human ratings and high ranking accuracy. Weaknesses: a strong bias toward lexical overlap, missed subtle paraphrase errors, occasional false inferences, and unstable adherence to prompt instructions. Use CoT-style prompts and domain calibration before trusting in‑s
Problem Statement
Existing factuality metrics either need lots of annotated data or heavy pipelines and often disagree with humans. The paper asks: can ChatGPT serve as a ready, zero-shot evaluator for factual inconsistency in summaries, and what are its limits?
Main Contribution
Systematic zero-shot evaluation of ChatGPT on three factuality tasks: entailment inference, summary ranking, and consistency rating.
Shows chain-of-thought prompts improve ChatGPT's factuality judgments.
Compares ChatGPT to standard factuality metrics on SUMMAC, FRANK, SummEval and ranking data.
Analyzes failure modes: lexical bias, false reasoning, and prompt-following gaps.
Key Findings
ChatGPT (zero-shot + CoT) often matches or beats prior factuality metrics on multiple benchmarks.
ChatGPT favors lexical overlap and misses abstractive/paraphrase inconsistencies.
ChatGPT scores align strongly with human consistency ratings on FRANK.
Chain-of-thought prompting substantially boosts performance over direct prompts.
ChatGPT sometimes produces false or post-hoc reasoning and can ignore prompt definitions.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Correlation with human ratings (consistency score)
Specificity vs Sensitivity (entailment inference)
Who Should Care
What To Try In 7 Days
Run ChatGPT ZS-COT prompts on a sample of your summaries and compare to your current metric on balanced accuracy and correlation with human labels.
Test sensitivity by creating paraphrase-based errors (abstractive changes) to see if ChatGPT misses them.
Use ChatGPT ranking prompts to prioritize summaries and inspect top disagreements manually for patterns of failure.
Reproducibility
Data Urls
- SUMMAC datasets referenced (FactCC, CoGenSumm, XSumFaith, SummEval, FRANK, Polytope)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only zero-shot prompts on ChatGPT API were tested; no few-shot or comparison with GPT-4.
- Limited scope due to API cost; some prompt variations not explored.
- ChatGPT shows lexical bias and misses paraphrase-based inconsistencies.
- ChatGPT can produce false or post-hoc explanations and may not follow prompt definitions strictly.
When Not To Use
- When summaries are highly abstractive or paraphrased with low lexical overlap
- For high-stakes factual verification without human review
- If you require deterministic, explainable rule-based checks
Failure Modes
- False positive consistency when lexical overlap is high but meaning changed
- False reasoning: justifying incorrect conclusions post-hoc
- Poor sensitivity to inconsistent examples (misses many errors)
- Inability to consistently follow rating definitions in prompts
Core Entities
Models
- ChatGPT (gpt-3.5-turbo-0301)
Metrics
- Accuracy
- Sensitivity (recall+)
- Specificity (recall-)
- Pearson correlation
- Spearman correlation
- Kendall Tau
Datasets
- SUMMAC (CoGenSumm, XSumFaith, Polytope, FactCC, SummEval, FRANK)
- Falke ranking dataset (CNN/DM samples)
- SummEval
- FRANK
Benchmarks
- SUMMAC
Context Entities
Models
- SummaC (ZS and Conv)
- QuestEval
- FactCC
- FEQA
- DAE
- MNLI-doc
- NER Overlap
- QAGS
- BARTScore (mentioned)
Metrics
- ROUGE (mentioned as baseline for quality)
- BERTScore (mentioned)
Datasets
- CNN/DM (used for ranking)
- XSum (as origin of abstractive outputs)
Benchmarks
- SUMMAC (Laban et al., 2022)

