Overview
The dataset and broad evaluation are well documented and reproducible; however, high refusal rates and language gaps reduce immediate production readiness.
Citations1
Evidence Strength0.85
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 0/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 55%
Why It Matters For Business
LLMs can help scale fact-checking but often miss factual claims and skip many judgments; companies should not rely on LLM-only pipelines for high-stakes verification.
Who Should Care
Summary TLDR
The authors release FactSpan, a 61,514-claim multilingual fact-checking dataset (30 languages, 2007–2024) and use it to test five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B/8B, Mixtral 8x7B). GPT-4o gets the highest binary accuracy (73.31%) but refuses to judge 43% of claims. Across models, claims framed as factual statements are far harder to classify than opinions (example: GPT-3.5 error 41.2% for facts vs 21.3% for opinions). Performance also varies strongly by language and topic. The paper warns against blind deployment of LLM-only fact-checkers and provides dataset and code on Zenodo/GitHub.
Problem Statement
Current LLM fact-check evaluations focus narrowly on English and a few topics, leaving open whether models generalize across languages, topics, and claim styles. This paper asks: which claim features (language, topic, factual vs opinion, structure) influence LLM fact-checking accuracy in realistic multilingual data?
Main Contribution
FactSpan: a dynamically extensible multilingual fact-checking dataset with 61,514 verifiable, text-only claims across 30 languages and five topics.
A head-to-head evaluation of five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B & 8B, Mixtral 8x7B) on the dataset, reporting accuracy and refusal rates per language and pre/post training cutoff.
Key Findings
Large multilingual dataset released: FactSpan contains 61,514 claims in 30 languages.
GPT-4o achieves the highest binary accuracy but often declines to judge claims.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 73.31% (GPT-4o) | — | — | FactSpan (61,514 claims) | GPT-4o had highest binary accuracy across evaluated models | Section 4.1 |
| No-verdict rate (worst coverage) | 43.02% (GPT-4o) | — | — | FactSpan | GPT-4o declined to judge a large share of claims | Abstract; Section 4.1 |
What To Try In 7 Days
Run your claims through FactSpan-sampled subset to measure per-language performance.
Mark claims framed as 'factual' and route them for evidence retrieval or human review.
Set a no-verdict threshold (e.g., >10%) to trigger manual escalation and log reasons for refusals.
Reproducibility
Risks & Boundaries
Limitations
Date as proxy for novelty: post-cutoff claims can reference earlier events, complicating interpretation.
Dataset skew: languages and topics biased toward active fact-checking communities.
When Not To Use
As a fully automated, unsupervised fact-checking system for high-stakes decisions.
For claims that require image/video/audio evidence.
Failure Modes
Refusal/coverage gaps: high no-verdict rates reduce automation gains.
Surface heuristics: models may rely on tone or plausibility over evidence.

