LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

November 1, 20236 min

Overview

Production Readiness

0.2

Novelty Score

0.3

Cost Impact Score

0.2

Citation Count

1

Authors

Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen, Shashi Bhushan TN

Links

Abstract / PDF

Why It Matters For Business

LLMs are not yet reliable solo tools for automatic fact-checking of summaries; using them as the sole grader can miss or invert errors and risk incorrect decisions.

Summary TLDR

The paper tests whether GPT-3.5, GPT-4 and PaLM-2 can judge factual consistency of summaries. They try two modes: (1) a single-LLM QA pipeline that generates answers/questions and checks overlap, and (2) direct 1–5 faithfulness scoring. Using the FRANK benchmark and partial correlations to control confounders, they find near-zero correlations with human judgments for most models and error types. GPT-3.5 shows modest positive signals for predicate and circumstance errors (Spearman ≈ 0.33–0.37). Overall, current LLMs should not be treated as reliable automatic factuality judges without domain validation.

Problem Statement

Can current LLMs (GPT-3.5, GPT-4, PaLM-2) reliably evaluate the factual consistency of model-generated summaries? Prior automatic metrics show poor agreement with humans; this study tests single-LLM QA evaluation and direct scoring on the FRANK benchmark while controlling for confounders with partial correlation.

Main Contribution

Propose a single-LLM QA pipeline that uses one model to do answer selection, question generation, and question answering for factuality checks.

Evaluate GPT-3.5, GPT-4, and PaLM-2 as (a) QA-based factuality evaluators and (b) direct 1–5 faithfulness scorers on the FRANK benchmark.

Use partial correlation (controls for dataset/system confounders) and report that correlations with human judgments are near zero for most models and error types; GPT-3.5 shows limited signals in two subcategories.

Key Findings

Across the FRANK benchmark, LLM-based factuality metrics mostly show near-zero correlation with human judgments.

NumbersMost Pearson/Spearman coefficients ≈ 0; none > 0.3 for GPT-4 and PaLM-2 on evaluated splits

GPT-3.5 shows modest positive Spearman correlation for two error types: predicate and circumstance errors.

NumbersSpearman r ≈ 0.3337 (PredE) and 0.3702 (CircE), p≈0.0000 on FRANK

Statistical significance appeared for tiny coefficients that are not practically useful.

NumbersPaLM-2 Spearman for overall factuality = -0.0632 with p=0.0121

Results

QA-based evaluator: PaLM-2 Spearman (overall factuality)

Value-0.0632 (p=0.0121)

Direct scoring: GPT-3.5 Spearman (Predicate Errors)

Value0.3337 (p≈0.0000)

Direct scoring: GPT-3.5 Spearman (Circumstance Errors)

Value0.3702 (p≈0.0000)

Who Should Care

What To Try In 7 Days

Run a small validation: compare your LLM evaluator scores to human labels on 100 domain examples before automating.

If using LLM scores, treat them as flags for human review rather than final verdicts.

Measure partial correlations (control for dataset/system) to spot spurious high correlations.

Reproducibility

Data Urls

  • FRANK benchmark (Pagnoni et al., 2021) as used in paper

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to three closed-source LLMs and a single benchmark (FRANK).
  • Models are closed-source and change over time; results may vary with newer model versions.
  • Study focuses on summarization outputs (CNN-DM, XSUM); conclusions may not generalize to other domains.

When Not To Use

  • Do not use these LLM scores as the sole factuality gate for high-stakes content.
  • Avoid deploying direct LLM judgment in unfamiliar domains without labeled validation data.

Failure Modes

  • Tiny but statistically significant coefficients that have no practical value.
  • Negative correlations for some error types (model score increases while human errors increase).
  • High variance across error categories: some error types show signal while most do not.

Core Entities

Models

  • gpt-3.5-turbo-0613
  • gpt-4-0613
  • PaLM-2 (text-bison@001)

Metrics

  • partial correlation
  • Pearson
  • Spearman
  • word F1

Datasets

  • FRANK
  • CNN-DM
  • XSUM

Benchmarks

  • FRANK