Overview
Production Readiness
0.2
Novelty Score
0.3
Cost Impact Score
0.2
Citation Count
1
Why It Matters For Business
LLMs are not yet reliable solo tools for automatic fact-checking of summaries; using them as the sole grader can miss or invert errors and risk incorrect decisions.
Summary TLDR
The paper tests whether GPT-3.5, GPT-4 and PaLM-2 can judge factual consistency of summaries. They try two modes: (1) a single-LLM QA pipeline that generates answers/questions and checks overlap, and (2) direct 1–5 faithfulness scoring. Using the FRANK benchmark and partial correlations to control confounders, they find near-zero correlations with human judgments for most models and error types. GPT-3.5 shows modest positive signals for predicate and circumstance errors (Spearman ≈ 0.33–0.37). Overall, current LLMs should not be treated as reliable automatic factuality judges without domain validation.
Problem Statement
Can current LLMs (GPT-3.5, GPT-4, PaLM-2) reliably evaluate the factual consistency of model-generated summaries? Prior automatic metrics show poor agreement with humans; this study tests single-LLM QA evaluation and direct scoring on the FRANK benchmark while controlling for confounders with partial correlation.
Main Contribution
Propose a single-LLM QA pipeline that uses one model to do answer selection, question generation, and question answering for factuality checks.
Evaluate GPT-3.5, GPT-4, and PaLM-2 as (a) QA-based factuality evaluators and (b) direct 1–5 faithfulness scorers on the FRANK benchmark.
Use partial correlation (controls for dataset/system confounders) and report that correlations with human judgments are near zero for most models and error types; GPT-3.5 shows limited signals in two subcategories.
Key Findings
Across the FRANK benchmark, LLM-based factuality metrics mostly show near-zero correlation with human judgments.
GPT-3.5 shows modest positive Spearman correlation for two error types: predicate and circumstance errors.
Statistical significance appeared for tiny coefficients that are not practically useful.
Results
QA-based evaluator: PaLM-2 Spearman (overall factuality)
Direct scoring: GPT-3.5 Spearman (Predicate Errors)
Direct scoring: GPT-3.5 Spearman (Circumstance Errors)
Who Should Care
What To Try In 7 Days
Run a small validation: compare your LLM evaluator scores to human labels on 100 domain examples before automating.
If using LLM scores, treat them as flags for human review rather than final verdicts.
Measure partial correlations (control for dataset/system) to spot spurious high correlations.
Reproducibility
Data Urls
- FRANK benchmark (Pagnoni et al., 2021) as used in paper
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to three closed-source LLMs and a single benchmark (FRANK).
- Models are closed-source and change over time; results may vary with newer model versions.
- Study focuses on summarization outputs (CNN-DM, XSUM); conclusions may not generalize to other domains.
When Not To Use
- Do not use these LLM scores as the sole factuality gate for high-stakes content.
- Avoid deploying direct LLM judgment in unfamiliar domains without labeled validation data.
Failure Modes
- Tiny but statistically significant coefficients that have no practical value.
- Negative correlations for some error types (model score increases while human errors increase).
- High variance across error categories: some error types show signal while most do not.
Core Entities
Models
- gpt-3.5-turbo-0613
- gpt-4-0613
- PaLM-2 (text-bison@001)
Metrics
- partial correlation
- Pearson
- Spearman
- word F1
Datasets
- FRANK
- CNN-DM
- XSUM
Benchmarks
- FRANK

