LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Overview

Decision SnapshotNeeds Validation

The study uses a public benchmark (FRANK) and partial correlations, producing clear but small effect sizes; evidence is solid for the claim that off-the-shelf LLMs are not reliable factuality judges on this benchmark.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 20%

Novelty: 30%

Authors

Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen, Shashi Bhushan TN

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs are not yet reliable solo tools for automatic fact-checking of summaries; using them as the sole grader can miss or invert errors and risk incorrect decisions.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

The paper tests whether GPT-3.5, GPT-4 and PaLM-2 can judge factual consistency of summaries. They try two modes: (1) a single-LLM QA pipeline that generates answers/questions and checks overlap, and (2) direct 1–5 faithfulness scoring. Using the FRANK benchmark and partial correlations to control confounders, they find near-zero correlations with human judgments for most models and error types. GPT-3.5 shows modest positive signals for predicate and circumstance errors (Spearman ≈ 0.33–0.37). Overall, current LLMs should not be treated as reliable automatic factuality judges without domain validation.

Problem Statement

Can current LLMs (GPT-3.5, GPT-4, PaLM-2) reliably evaluate the factual consistency of model-generated summaries? Prior automatic metrics show poor agreement with humans; this study tests single-LLM QA evaluation and direct scoring on the FRANK benchmark while controlling for confounders with partial correlation.

Main Contribution

Propose a single-LLM QA pipeline that uses one model to do answer selection, question generation, and question answering for factuality checks.

Evaluate GPT-3.5, GPT-4, and PaLM-2 as (a) QA-based factuality evaluators and (b) direct 1–5 faithfulness scorers on the FRANK benchmark.

Key Findings

Across the FRANK benchmark, LLM-based factuality metrics mostly show near-zero correlation with human judgments.

NumbersMost Pearson/Spearman coefficients ≈ 0; none > 0.3 for GPT-4 and PaLM-2 on evaluated splits

Practical UseDo not replace human factuality checks with these LLM scores; validate any automatic judge on your domain before trusting it.

Evidence RefTables 2–3 (partial Pearson/Spearman on FRANK)

GPT-3.5 shows modest positive Spearman correlation for two error types: predicate and circumstance errors.

NumbersSpearman r ≈ 0.3337 (PredE) and 0.3702 (CircE), p≈0.0000 on FRANK

Practical UseGPT-3.5 may help detect some specific error classes, but signal is narrow; use it only as an additional triage signal, not as a final judge.

Evidence RefTable 3 (GPT-3.5 Spearman for PredE and CircE)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
QA-based evaluator: PaLM-2 Spearman (overall factuality)	-0.0632 (p=0.0121)	—	—	FRANK (overall)	Table 2 shows small negative Spearman but statistically significant p-value; effect size tiny.	Table 2
Direct scoring: GPT-3.5 Spearman (Predicate Errors)	0.3337 (p≈0.0000)	—	—	FRANK (PredE)	Table 3 reports a moderate Spearman correlation for GPT-3.5 on PredE.	Table 3

What To Try In 7 Days

Run a small validation: compare your LLM evaluator scores to human labels on 100 domain examples before automating.

If using LLM scores, treat them as flags for human review rather than final verdicts.

Measure partial correlations (control for dataset/system) to spot spurious high correlations.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

FRANK benchmark (Pagnoni et al., 2021) as used in paper

Risks & Boundaries

Limitations

Evaluation limited to three closed-source LLMs and a single benchmark (FRANK).

Models are closed-source and change over time; results may vary with newer model versions.

When Not To Use

Do not use these LLM scores as the sole factuality gate for high-stakes content.

Avoid deploying direct LLM judgment in unfamiliar domains without labeled validation data.

Failure Modes

Tiny but statistically significant coefficients that have no practical value.

Negative correlations for some error types (model score increases while human errors increase).

Core Entities

Models

gpt-3.5-turbo-0613gpt-4-0613PaLM-2 (text-bison@001)

Metrics

partial correlationPearsonSpearmanword F1

Datasets

FRANKCNN-DMXSUM

Benchmarks

FRANK

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Across the FRANK benchmark, LLM-based factuality metrics mostly show near-zero correlation with human judgments.

GPT-3.5 shows modest positive Spearman correlation for two error types: predicate and circumstance errors.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding