Head-to-head fact-check: GPT-4 tops GPT-3.5, Bard, Bing but all score ~65%

June 18, 20236 min

Overview

Decision SnapshotNeeds Validation

The paper provides a simple, reproducible comparison and open dataset, but uses a small 100-item sample and a forced-choice setup, limiting generalization to other domains and time-sensitive claims.

Citations13

Evidence Strength0.60

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 40%

Novelty: 30%

Authors

Kevin Matthe Caramancion

Links

Abstract / PDF / Data

Why It Matters For Business

Off-the-shelf LLMs can flag likely false claims but only catch roughly two-thirds of cases on similar datasets, so firms should pair models with human review to avoid costly mistakes.

Who Should Care

Summary TLDR

The author tested four popular LLMs (GPT-3.5, GPT-4, Bard/LaMDA, Bing AI) on 100 fact-checked news items. Models were asked to label items True, False, or Partially True/False. Average accuracy was 65.25/100; GPT-4 scored highest at 71, GPT-3.5 62, Bard 64, Bing 64. Data and results are available on Kaggle. The study shows LLMs can help triage claims but are still far from replacing human fact-checkers, especially on nuance and context.

Problem Statement

Can current mainstream LLMs correctly classify real news claims as true, false, or partially true/false when judged against independent fact-checkers? The paper measures accuracy under controlled black-box conditions using 100 fact-checked items.

Main Contribution

Controlled black-box comparison of four LLMs on fact-checking.

Evaluation on 100 real news items sourced from independent fact-checkers.

Key Findings

Average accuracy across models is moderate.

Numbers65.25 / 100 average accuracy

Practical UseUse LLMs for preliminary triage, not final verification; expect ~65% correct on similar items.

Evidence RefAbstract; IV Findings

GPT-4 outperformed other tested models.

NumbersGPT-4 = 71 / 100

Practical UsePrefer GPT-4 over GPT-3.5 for fact-flagging tasks when accuracy matters, but still verify outputs.

Evidence RefAbstract; V Analyses

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy65.25 / 100100 fact-checked items (pre-Sep 2021)Abstract; V AnalysesIV Findings; Kaggle data
Accuracy71 / 100100 itemsAbstract; V AnalysesIV Findings

What To Try In 7 Days

Run a quick in-house evaluation: sample 100 domain-relevant claims and measure accuracy.

Prefer GPT-4 for higher accuracy, but validate outputs with a human fact-checker.

Use the Kaggle dataset as a baseline and adapt prompts to your domain and cutoff dates.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://doi.org/10.34740/KAGGLE/DSV/5959587Kaggle: LLM Comparative Performance Fake News Detection_v1 (DOI above)

Risks & Boundaries

Limitations

Small dataset: only 100 news items, limits statistical power.

Items restricted to pre-Sep 2021 to match knowledge cutoff—results don't cover newer claims.

When Not To Use

Do not use these raw LLM outputs as sole verification in high-stakes or legal contexts.

Avoid relying on these results for post-2021 or time-sensitive claims without web-enabled models.

Failure Modes

Hallucinations: confident but incorrect assertions.

Misclassification of partially true/false nuance due to forced-choice labeling.

Core Entities

Models

GPT-3.5GPT-4.0Bard/LaMDABing AI (Prometheus/Sydney)

Metrics

Accuracy

Datasets

100 fact-checked news items (independent fact-checkers: PolitiFact, Snopes)Kaggle dataset: LLM Comparative Performance Fake News Detection_v1 (DOI provided)

Context Entities

Models

Knowledge cutoff: Sep 2021 (affects GPT-3.5 and GPT-4 behavior)

Metrics

Accuracy

Datasets

Items limited to pre-September 2021 to match knowledge cutoff