LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

November 1, 20236 min

Overview

Decision SnapshotNeeds Validation

The study uses a public benchmark (FRANK) and partial correlations, producing clear but small effect sizes; evidence is solid for the claim that off-the-shelf LLMs are not reliable factuality judges on this benchmark.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 20%

Novelty: 30%

Authors

Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen, Shashi Bhushan TN

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs are not yet reliable solo tools for automatic fact-checking of summaries; using them as the sole grader can miss or invert errors and risk incorrect decisions.

Who Should Care

Summary TLDR

The paper tests whether GPT-3.5, GPT-4 and PaLM-2 can judge factual consistency of summaries. They try two modes: (1) a single-LLM QA pipeline that generates answers/questions and checks overlap, and (2) direct 1–5 faithfulness scoring. Using the FRANK benchmark and partial correlations to control confounders, they find near-zero correlations with human judgments for most models and error types. GPT-3.5 shows modest positive signals for predicate and circumstance errors (Spearman ≈ 0.33–0.37). Overall, current LLMs should not be treated as reliable automatic factuality judges without domain validation.

Problem Statement

Can current LLMs (GPT-3.5, GPT-4, PaLM-2) reliably evaluate the factual consistency of model-generated summaries? Prior automatic metrics show poor agreement with humans; this study tests single-LLM QA evaluation and direct scoring on the FRANK benchmark while controlling for confounders with partial correlation.

Main Contribution

Propose a single-LLM QA pipeline that uses one model to do answer selection, question generation, and question answering for factuality checks.

Evaluate GPT-3.5, GPT-4, and PaLM-2 as (a) QA-based factuality evaluators and (b) direct 1–5 faithfulness scorers on the FRANK benchmark.

Key Findings

Across the FRANK benchmark, LLM-based factuality metrics mostly show near-zero correlation with human judgments.

NumbersMost Pearson/Spearman coefficients ≈ 0; none > 0.3 for GPT-4 and PaLM-2 on evaluated splits

Practical UseDo not replace human factuality checks with these LLM scores; validate any automatic judge on your domain before trusting it.

Evidence RefTables 2–3 (partial Pearson/Spearman on FRANK)

GPT-3.5 shows modest positive Spearman correlation for two error types: predicate and circumstance errors.

NumbersSpearman r ≈ 0.3337 (PredE) and 0.3702 (CircE), p≈0.0000 on FRANK

Practical UseGPT-3.5 may help detect some specific error classes, but signal is narrow; use it only as an additional triage signal, not as a final judge.

Evidence RefTable 3 (GPT-3.5 Spearman for PredE and CircE)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
QA-based evaluator: PaLM-2 Spearman (overall factuality)-0.0632 (p=0.0121)FRANK (overall)Table 2 shows small negative Spearman but statistically significant p-value; effect size tiny.Table 2
Direct scoring: GPT-3.5 Spearman (Predicate Errors)0.3337 (p≈0.0000)FRANK (PredE)Table 3 reports a moderate Spearman correlation for GPT-3.5 on PredE.Table 3

What To Try In 7 Days

Run a small validation: compare your LLM evaluator scores to human labels on 100 domain examples before automating.

If using LLM scores, treat them as flags for human review rather than final verdicts.

Measure partial correlations (control for dataset/system) to spot spurious high correlations.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

FRANK benchmark (Pagnoni et al., 2021) as used in paper

Risks & Boundaries

Limitations

Evaluation limited to three closed-source LLMs and a single benchmark (FRANK).

Models are closed-source and change over time; results may vary with newer model versions.

When Not To Use

Do not use these LLM scores as the sole factuality gate for high-stakes content.

Avoid deploying direct LLM judgment in unfamiliar domains without labeled validation data.

Failure Modes

Tiny but statistically significant coefficients that have no practical value.

Negative correlations for some error types (model score increases while human errors increase).

Core Entities

Models

gpt-3.5-turbo-0613gpt-4-0613PaLM-2 (text-bison@001)

Metrics

partial correlationPearsonSpearmanword F1

Datasets

FRANKCNN-DMXSUM

Benchmarks

FRANK