Overview
Production Readiness
0.4
Novelty Score
0.3
Cost Impact Score
0.25
Citation Count
13
Why It Matters For Business
Off-the-shelf LLMs can flag likely false claims but only catch roughly two-thirds of cases on similar datasets, so firms should pair models with human review to avoid costly mistakes.
Summary TLDR
The author tested four popular LLMs (GPT-3.5, GPT-4, Bard/LaMDA, Bing AI) on 100 fact-checked news items. Models were asked to label items True, False, or Partially True/False. Average accuracy was 65.25/100; GPT-4 scored highest at 71, GPT-3.5 62, Bard 64, Bing 64. Data and results are available on Kaggle. The study shows LLMs can help triage claims but are still far from replacing human fact-checkers, especially on nuance and context.
Problem Statement
Can current mainstream LLMs correctly classify real news claims as true, false, or partially true/false when judged against independent fact-checkers? The paper measures accuracy under controlled black-box conditions using 100 fact-checked items.
Main Contribution
Controlled black-box comparison of four LLMs on fact-checking.
Evaluation on 100 real news items sourced from independent fact-checkers.
Open release of the experiment data on Kaggle for reproducibility.
Key Findings
Average accuracy across models is moderate.
GPT-4 outperformed other tested models.
Other models had similar, lower scores.
Dataset and test items limited to pre-September 2021 items.
Evaluation used forced three-way label prompting.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Dataset size
Who Should Care
What To Try In 7 Days
Run a quick in-house evaluation: sample 100 domain-relevant claims and measure accuracy.
Prefer GPT-4 for higher accuracy, but validate outputs with a human fact-checker.
Use the Kaggle dataset as a baseline and adapt prompts to your domain and cutoff dates.
Reproducibility
Data Urls
- https://doi.org/10.34740/KAGGLE/DSV/5959587
- Kaggle: LLM Comparative Performance Fake News Detection_v1 (DOI above)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small dataset: only 100 news items, limits statistical power.
- Items restricted to pre-Sep 2021 to match knowledge cutoff—results don't cover newer claims.
- Evaluation metric limited to forced-choice accuracy; no analysis of explanations or confidence.
- Fact-checker labels treated as ground truth though agencies may err.
- No code release; only dataset released which complicates full reproducibility.
When Not To Use
- Do not use these raw LLM outputs as sole verification in high-stakes or legal contexts.
- Avoid relying on these results for post-2021 or time-sensitive claims without web-enabled models.
- Not suitable where nuanced, multi-source corroboration is required.
Failure Modes
- Hallucinations: confident but incorrect assertions.
- Misclassification of partially true/false nuance due to forced-choice labeling.
- Knowledge cutoff blind spots for facts emerging after Sep 2021.
- Overreliance on single-model outputs leading to missed context or source checks.
Core Entities
Models
- GPT-3.5
- GPT-4.0
- Bard/LaMDA
- Bing AI (Prometheus/Sydney)
Metrics
- Accuracy
Datasets
- 100 fact-checked news items (independent fact-checkers: PolitiFact, Snopes)
- Kaggle dataset: LLM Comparative Performance Fake News Detection_v1 (DOI provided)
Context Entities
Models
- Knowledge cutoff: Sep 2021 (affects GPT-3.5 and GPT-4 behavior)
Metrics
- Accuracy
Datasets
- Items limited to pre-September 2021 to match knowledge cutoff

