Head-to-head fact-check: GPT-4 tops GPT-3.5, Bard, Bing but all score ~65%

Overview

Decision SnapshotNeeds Validation

The paper provides a simple, reproducible comparison and open dataset, but uses a small 100-item sample and a forced-choice setup, limiting generalization to other domains and time-sensitive claims.

Citations13

Evidence Strength0.60

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 40%

Novelty: 30%

Authors

Kevin Matthe Caramancion

Links

Abstract / PDF / Data

Why It Matters For Business

Off-the-shelf LLMs can flag likely false claims but only catch roughly two-thirds of cases on similar datasets, so firms should pair models with human review to avoid costly mistakes.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The author tested four popular LLMs (GPT-3.5, GPT-4, Bard/LaMDA, Bing AI) on 100 fact-checked news items. Models were asked to label items True, False, or Partially True/False. Average accuracy was 65.25/100; GPT-4 scored highest at 71, GPT-3.5 62, Bard 64, Bing 64. Data and results are available on Kaggle. The study shows LLMs can help triage claims but are still far from replacing human fact-checkers, especially on nuance and context.

Problem Statement

Can current mainstream LLMs correctly classify real news claims as true, false, or partially true/false when judged against independent fact-checkers? The paper measures accuracy under controlled black-box conditions using 100 fact-checked items.

Main Contribution

Controlled black-box comparison of four LLMs on fact-checking.

Evaluation on 100 real news items sourced from independent fact-checkers.

Key Findings

Average accuracy across models is moderate.

Numbers65.25 / 100 average accuracy

Practical UseUse LLMs for preliminary triage, not final verification; expect ~65% correct on similar items.

Evidence RefAbstract; IV Findings

GPT-4 outperformed other tested models.

NumbersGPT-4 = 71 / 100

Practical UsePrefer GPT-4 over GPT-3.5 for fact-flagging tasks when accuracy matters, but still verify outputs.

Evidence RefAbstract; V Analyses

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	65.25 / 100	—	—	100 fact-checked items (pre-Sep 2021)	Abstract; V Analyses	IV Findings; Kaggle data
Accuracy	71 / 100	—	—	100 items	Abstract; V Analyses	IV Findings

What To Try In 7 Days

Run a quick in-house evaluation: sample 100 domain-relevant claims and measure accuracy.

Prefer GPT-4 for higher accuracy, but validate outputs with a human fact-checker.

Use the Kaggle dataset as a baseline and adapt prompts to your domain and cutoff dates.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://doi.org/10.34740/KAGGLE/DSV/5959587Kaggle: LLM Comparative Performance Fake News Detection_v1 (DOI above)

Risks & Boundaries

Limitations

Small dataset: only 100 news items, limits statistical power.

Items restricted to pre-Sep 2021 to match knowledge cutoff—results don't cover newer claims.

When Not To Use

Do not use these raw LLM outputs as sole verification in high-stakes or legal contexts.

Avoid relying on these results for post-2021 or time-sensitive claims without web-enabled models.

Failure Modes

Hallucinations: confident but incorrect assertions.

Misclassification of partially true/false nuance due to forced-choice labeling.

Core Entities

Models

GPT-3.5GPT-4.0Bard/LaMDABing AI (Prometheus/Sydney)

Metrics

Accuracy

Datasets

100 fact-checked news items (independent fact-checkers: PolitiFact, Snopes)Kaggle dataset: LLM Comparative Performance Fake News Detection_v1 (DOI provided)

Context Entities

Models

Knowledge cutoff: Sep 2021 (affects GPT-3.5 and GPT-4 behavior)

Metrics

Accuracy

Datasets

Items limited to pre-September 2021 to match knowledge cutoff

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Average accuracy across models is moderate.

GPT-4 outperformed other tested models.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding