Overview
Empirical results on public benchmarks show practical promise for zero-shot evaluation, but lexical bias, false reasoning, and prompt instability lower reliability for high-stakes or abstract summaries; validate per use case.
Citations50
Evidence Strength0.70
Confidence0.78
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/7
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
ChatGPT offers a ready-to-use, zero-shot factuality evaluator that can reduce annotation and training costs and often aligns better with human judgments, but it needs calibration for paraphrase-heavy or domain-specific text.
Who Should Care
Summary TLDR
This study tests ChatGPT (gpt-3.5-turbo-0301) as a zero-shot factuality judge for summarization across three tasks: binary entailment (is the summary entailed?), pairwise ranking (which of two is faithful?), and numeric consistency rating (1–10). ChatGPT with a chain-of-thought (CoT) prompt often matches or beats specialized metrics on public benchmarks (SUMMAC, FRANK, SummEval, CNN/DM ranking). Strengths: strong correlation with human ratings and high ranking accuracy. Weaknesses: a strong bias toward lexical overlap, missed subtle paraphrase errors, occasional false inferences, and unstable adherence to prompt instructions. Use CoT-style prompts and domain calibration before trusting in‑s
Problem Statement
Existing factuality metrics either need lots of annotated data or heavy pipelines and often disagree with humans. The paper asks: can ChatGPT serve as a ready, zero-shot evaluator for factual inconsistency in summaries, and what are its limits?
Main Contribution
Systematic zero-shot evaluation of ChatGPT on three factuality tasks: entailment inference, summary ranking, and consistency rating.
Shows chain-of-thought prompts improve ChatGPT's factuality judgments.
Key Findings
ChatGPT (zero-shot + CoT) often matches or beats prior factuality metrics on multiple benchmarks.
ChatGPT favors lexical overlap and misses abstractive/paraphrase inconsistencies.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 74.3% (ChatGPT ZS-COT on CoGenSum) | SummaC ZS 70.4% | +3.9% | CoGenSum (SUMMAC) | Table 2: ChatGPT ZS-COT 74.3 vs SummaC ZS 70.4 | Table 2 |
| Accuracy | 83.3% (ChatGPT ZS-COT on SummEval) | SummaC ZS 78.7% | +4.6% | SummEval (SUMMAC) | Table 2: ChatGPT ZS-COT 83.3 vs SummaC ZS 78.7 | Table 2 |
What To Try In 7 Days
Run ChatGPT ZS-COT prompts on a sample of your summaries and compare to your current metric on balanced accuracy and correlation with human labels.
Test sensitivity by creating paraphrase-based errors (abstractive changes) to see if ChatGPT misses them.
Use ChatGPT ranking prompts to prioritize summaries and inspect top disagreements manually for patterns of failure.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Only zero-shot prompts on ChatGPT API were tested; no few-shot or comparison with GPT-4.
Limited scope due to API cost; some prompt variations not explored.
When Not To Use
When summaries are highly abstractive or paraphrased with low lexical overlap
For high-stakes factual verification without human review
Failure Modes
False positive consistency when lexical overlap is high but meaning changed
False reasoning: justifying incorrect conclusions post-hoc

