Overview
Production Readiness
0.4
Novelty Score
0.55
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
LLMs can help scale fact-checking but often miss factual claims and skip many judgments; companies should not rely on LLM-only pipelines for high-stakes verification.
Summary TLDR
The authors release FactSpan, a 61,514-claim multilingual fact-checking dataset (30 languages, 2007–2024) and use it to test five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B/8B, Mixtral 8x7B). GPT-4o gets the highest binary accuracy (73.31%) but refuses to judge 43% of claims. Across models, claims framed as factual statements are far harder to classify than opinions (example: GPT-3.5 error 41.2% for facts vs 21.3% for opinions). Performance also varies strongly by language and topic. The paper warns against blind deployment of LLM-only fact-checkers and provides dataset and code on Zenodo/GitHub.
Problem Statement
Current LLM fact-check evaluations focus narrowly on English and a few topics, leaving open whether models generalize across languages, topics, and claim styles. This paper asks: which claim features (language, topic, factual vs opinion, structure) influence LLM fact-checking accuracy in realistic multilingual data?
Main Contribution
FactSpan: a dynamically extensible multilingual fact-checking dataset with 61,514 verifiable, text-only claims across 30 languages and five topics.
A head-to-head evaluation of five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B & 8B, Mixtral 8x7B) on the dataset, reporting accuracy and refusal rates per language and pre/post training cutoff.
A claim-feature analysis showing consistent patterns of misclassification, notably that factual statements are more error-prone than opinion statements, plus logistic regression identifying language, age, and label complexity as predictors of errors.
Key Findings
Large multilingual dataset released: FactSpan contains 61,514 claims in 30 languages.
GPT-4o achieves the highest binary accuracy but often declines to judge claims.
GPT-3.5 Turbo shows good accuracy with fewer refusals than GPT-4o.
Open-source Mixtral and LLaMA models show lower accuracy and variable refusal behavior.
Claims framed as facts are much harder to classify than opinion claims.
Language and topic strongly influence error rates.
Closed-source models generalize to some post-cutoff claims but may use heuristics.
Results
Accuracy
No-verdict rate (worst coverage)
Accuracy
Error rate on factual vs opinion claims (GPT-3.5)
Language worst performers (error rate)
Accuracy
Who Should Care
What To Try In 7 Days
Run your claims through FactSpan-sampled subset to measure per-language performance.
Mark claims framed as 'factual' and route them for evidence retrieval or human review.
Set a no-verdict threshold (e.g., >10%) to trigger manual escalation and log reasons for refusals.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Date as proxy for novelty: post-cutoff claims can reference earlier events, complicating interpretation.
- Dataset skew: languages and topics biased toward active fact-checking communities.
- Text-only focus: multimedia claims (images, video) excluded.
- Annotation risk: some annotations relied on LLMs and a small manual validation sample.
When Not To Use
- As a fully automated, unsupervised fact-checking system for high-stakes decisions.
- For claims that require image/video/audio evidence.
- For low-resource languages without additional language-specific validation.
Failure Modes
- Refusal/coverage gaps: high no-verdict rates reduce automation gains.
- Surface heuristics: models may rely on tone or plausibility over evidence.
- Language bias: higher error rates in certain languages leading to unequal reliability.
- Label-complexity confusion: 'partly true/misleading' cases are frequently misclassified.
Core Entities
Models
- GPT-4o
- GPT-3.5 Turbo
- LLaMA 3.1 70B
- LLaMA 3.1 8B
- Mixtral 8x7B
Metrics
- Accuracy
- no-verdict percentage
- precision
- recall
- F1-score
Datasets
- FactSpan
- X-Fact
- ClaimReview
- Data Commons Feed
- FEVER (cited)
Benchmarks
- FactSpan evaluation (this paper)

