Head-to-head fact-check: GPT-4 tops GPT-3.5, Bard, Bing but all score ~65%

June 18, 20236 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.25

Citation Count

13

Authors

Kevin Matthe Caramancion

Links

Abstract / PDF

Why It Matters For Business

Off-the-shelf LLMs can flag likely false claims but only catch roughly two-thirds of cases on similar datasets, so firms should pair models with human review to avoid costly mistakes.

Summary TLDR

The author tested four popular LLMs (GPT-3.5, GPT-4, Bard/LaMDA, Bing AI) on 100 fact-checked news items. Models were asked to label items True, False, or Partially True/False. Average accuracy was 65.25/100; GPT-4 scored highest at 71, GPT-3.5 62, Bard 64, Bing 64. Data and results are available on Kaggle. The study shows LLMs can help triage claims but are still far from replacing human fact-checkers, especially on nuance and context.

Problem Statement

Can current mainstream LLMs correctly classify real news claims as true, false, or partially true/false when judged against independent fact-checkers? The paper measures accuracy under controlled black-box conditions using 100 fact-checked items.

Main Contribution

Controlled black-box comparison of four LLMs on fact-checking.

Evaluation on 100 real news items sourced from independent fact-checkers.

Open release of the experiment data on Kaggle for reproducibility.

Key Findings

Average accuracy across models is moderate.

Numbers65.25 / 100 average accuracy

GPT-4 outperformed other tested models.

NumbersGPT-4 = 71 / 100

Other models had similar, lower scores.

NumbersGPT-3.5 62; Bard 64; Bing 64

Dataset and test items limited to pre-September 2021 items.

Numbers100 items; items up to Sep 2021

Evaluation used forced three-way label prompting.

NumbersChoice enforced: True/False/Partially True/False

Results

Accuracy

Value65.25 / 100

Accuracy

Value71 / 100

Accuracy

Value62 / 100

Accuracy

Value64 / 100

Accuracy

Value64 / 100

Dataset size

Value100 items

Who Should Care

What To Try In 7 Days

Run a quick in-house evaluation: sample 100 domain-relevant claims and measure accuracy.

Prefer GPT-4 for higher accuracy, but validate outputs with a human fact-checker.

Use the Kaggle dataset as a baseline and adapt prompts to your domain and cutoff dates.

Reproducibility

Data Urls

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small dataset: only 100 news items, limits statistical power.
  • Items restricted to pre-Sep 2021 to match knowledge cutoff—results don't cover newer claims.
  • Evaluation metric limited to forced-choice accuracy; no analysis of explanations or confidence.
  • Fact-checker labels treated as ground truth though agencies may err.
  • No code release; only dataset released which complicates full reproducibility.

When Not To Use

  • Do not use these raw LLM outputs as sole verification in high-stakes or legal contexts.
  • Avoid relying on these results for post-2021 or time-sensitive claims without web-enabled models.
  • Not suitable where nuanced, multi-source corroboration is required.

Failure Modes

  • Hallucinations: confident but incorrect assertions.
  • Misclassification of partially true/false nuance due to forced-choice labeling.
  • Knowledge cutoff blind spots for facts emerging after Sep 2021.
  • Overreliance on single-model outputs leading to missed context or source checks.

Core Entities

Models

  • GPT-3.5
  • GPT-4.0
  • Bard/LaMDA
  • Bing AI (Prometheus/Sydney)

Metrics

  • Accuracy

Datasets

  • 100 fact-checked news items (independent fact-checkers: PolitiFact, Snopes)
  • Kaggle dataset: LLM Comparative Performance Fake News Detection_v1 (DOI provided)

Context Entities

Models

  • Knowledge cutoff: Sep 2021 (affects GPT-3.5 and GPT-4 behavior)

Metrics

  • Accuracy

Datasets

  • Items limited to pre-September 2021 to match knowledge cutoff