Overview
The study uses a public benchmark (FRANK) and partial correlations, producing clear but small effect sizes; evidence is solid for the claim that off-the-shelf LLMs are not reliable factuality judges on this benchmark.
Citations1
Evidence Strength0.70
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 0/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 20%
Production readiness: 20%
Novelty: 30%
Why It Matters For Business
LLMs are not yet reliable solo tools for automatic fact-checking of summaries; using them as the sole grader can miss or invert errors and risk incorrect decisions.
Who Should Care
Summary TLDR
The paper tests whether GPT-3.5, GPT-4 and PaLM-2 can judge factual consistency of summaries. They try two modes: (1) a single-LLM QA pipeline that generates answers/questions and checks overlap, and (2) direct 1–5 faithfulness scoring. Using the FRANK benchmark and partial correlations to control confounders, they find near-zero correlations with human judgments for most models and error types. GPT-3.5 shows modest positive signals for predicate and circumstance errors (Spearman ≈ 0.33–0.37). Overall, current LLMs should not be treated as reliable automatic factuality judges without domain validation.
Problem Statement
Can current LLMs (GPT-3.5, GPT-4, PaLM-2) reliably evaluate the factual consistency of model-generated summaries? Prior automatic metrics show poor agreement with humans; this study tests single-LLM QA evaluation and direct scoring on the FRANK benchmark while controlling for confounders with partial correlation.
Main Contribution
Propose a single-LLM QA pipeline that uses one model to do answer selection, question generation, and question answering for factuality checks.
Evaluate GPT-3.5, GPT-4, and PaLM-2 as (a) QA-based factuality evaluators and (b) direct 1–5 faithfulness scorers on the FRANK benchmark.
Key Findings
Across the FRANK benchmark, LLM-based factuality metrics mostly show near-zero correlation with human judgments.
GPT-3.5 shows modest positive Spearman correlation for two error types: predicate and circumstance errors.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| QA-based evaluator: PaLM-2 Spearman (overall factuality) | -0.0632 (p=0.0121) | — | — | FRANK (overall) | Table 2 shows small negative Spearman but statistically significant p-value; effect size tiny. | Table 2 |
| Direct scoring: GPT-3.5 Spearman (Predicate Errors) | 0.3337 (p≈0.0000) | — | — | FRANK (PredE) | Table 3 reports a moderate Spearman correlation for GPT-3.5 on PredE. | Table 3 |
What To Try In 7 Days
Run a small validation: compare your LLM evaluator scores to human labels on 100 domain examples before automating.
If using LLM scores, treat them as flags for human review rather than final verdicts.
Measure partial correlations (control for dataset/system) to spot spurious high correlations.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluation limited to three closed-source LLMs and a single benchmark (FRANK).
Models are closed-source and change over time; results may vary with newer model versions.
When Not To Use
Do not use these LLM scores as the sole factuality gate for high-stakes content.
Avoid deploying direct LLM judgment in unfamiliar domains without labeled validation data.
Failure Modes
Tiny but statistically significant coefficients that have no practical value.
Negative correlations for some error types (model score increases while human errors increase).

