Overview
The paper provides a simple, reproducible comparison and open dataset, but uses a small 100-item sample and a forced-choice setup, limiting generalization to other domains and time-sensitive claims.
Citations13
Evidence Strength0.60
Confidence0.90
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 25%
Production readiness: 40%
Novelty: 30%
Why It Matters For Business
Off-the-shelf LLMs can flag likely false claims but only catch roughly two-thirds of cases on similar datasets, so firms should pair models with human review to avoid costly mistakes.
Who Should Care
Summary TLDR
The author tested four popular LLMs (GPT-3.5, GPT-4, Bard/LaMDA, Bing AI) on 100 fact-checked news items. Models were asked to label items True, False, or Partially True/False. Average accuracy was 65.25/100; GPT-4 scored highest at 71, GPT-3.5 62, Bard 64, Bing 64. Data and results are available on Kaggle. The study shows LLMs can help triage claims but are still far from replacing human fact-checkers, especially on nuance and context.
Problem Statement
Can current mainstream LLMs correctly classify real news claims as true, false, or partially true/false when judged against independent fact-checkers? The paper measures accuracy under controlled black-box conditions using 100 fact-checked items.
Main Contribution
Controlled black-box comparison of four LLMs on fact-checking.
Evaluation on 100 real news items sourced from independent fact-checkers.
Key Findings
Average accuracy across models is moderate.
GPT-4 outperformed other tested models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 65.25 / 100 | — | — | 100 fact-checked items (pre-Sep 2021) | Abstract; V Analyses | IV Findings; Kaggle data |
| Accuracy | 71 / 100 | — | — | 100 items | Abstract; V Analyses | IV Findings |
What To Try In 7 Days
Run a quick in-house evaluation: sample 100 domain-relevant claims and measure accuracy.
Prefer GPT-4 for higher accuracy, but validate outputs with a human fact-checker.
Use the Kaggle dataset as a baseline and adapt prompts to your domain and cutoff dates.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Small dataset: only 100 news items, limits statistical power.
Items restricted to pre-Sep 2021 to match knowledge cutoff—results don't cover newer claims.
When Not To Use
Do not use these raw LLM outputs as sole verification in high-stakes or legal contexts.
Avoid relying on these results for post-2021 or time-sensitive claims without web-enabled models.
Failure Modes
Hallucinations: confident but incorrect assertions.
Misclassification of partially true/false nuance due to forced-choice labeling.

