LLMs make more mistakes on factual-sounding claims than on opinions across 61K multilingual fact-checks

June 4, 20257 min

Overview

Decision SnapshotNeeds Validation

The dataset and broad evaluation are well documented and reproducible; however, high refusal rates and language gaps reduce immediate production readiness.

Citations1

Evidence Strength0.85

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 55%

Authors

Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can help scale fact-checking but often miss factual claims and skip many judgments; companies should not rely on LLM-only pipelines for high-stakes verification.

Who Should Care

Summary TLDR

The authors release FactSpan, a 61,514-claim multilingual fact-checking dataset (30 languages, 2007–2024) and use it to test five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B/8B, Mixtral 8x7B). GPT-4o gets the highest binary accuracy (73.31%) but refuses to judge 43% of claims. Across models, claims framed as factual statements are far harder to classify than opinions (example: GPT-3.5 error 41.2% for facts vs 21.3% for opinions). Performance also varies strongly by language and topic. The paper warns against blind deployment of LLM-only fact-checkers and provides dataset and code on Zenodo/GitHub.

Problem Statement

Current LLM fact-check evaluations focus narrowly on English and a few topics, leaving open whether models generalize across languages, topics, and claim styles. This paper asks: which claim features (language, topic, factual vs opinion, structure) influence LLM fact-checking accuracy in realistic multilingual data?

Main Contribution

FactSpan: a dynamically extensible multilingual fact-checking dataset with 61,514 verifiable, text-only claims across 30 languages and five topics.

A head-to-head evaluation of five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B & 8B, Mixtral 8x7B) on the dataset, reporting accuracy and refusal rates per language and pre/post training cutoff.

Key Findings

Large multilingual dataset released: FactSpan contains 61,514 claims in 30 languages.

NumbersTotal claims = 61,514; languages = 30

Practical UseUse this dataset to evaluate or stress-test multilingual fact-checking systems instead of relying on small, English-only benchmarks.

Evidence RefSection 3.1, Table 1

GPT-4o achieves the highest binary accuracy but often declines to judge claims.

NumbersAccuracy 73.31%; No-verdict 43.02%

Practical UseGPT-4o is relatively accurate when it answers, but expect large coverage gaps; design systems to route no-verdicts to human reviewers or fallback checks.

Evidence RefAbstract; Section 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy73.31% (GPT-4o)FactSpan (61,514 claims)GPT-4o had highest binary accuracy across evaluated modelsSection 4.1
No-verdict rate (worst coverage)43.02% (GPT-4o)FactSpanGPT-4o declined to judge a large share of claimsAbstract; Section 4.1

What To Try In 7 Days

Run your claims through FactSpan-sampled subset to measure per-language performance.

Mark claims framed as 'factual' and route them for evidence retrieval or human review.

Set a no-verdict threshold (e.g., >10%) to trigger manual escalation and log reasons for refusals.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Date as proxy for novelty: post-cutoff claims can reference earlier events, complicating interpretation.

Dataset skew: languages and topics biased toward active fact-checking communities.

When Not To Use

As a fully automated, unsupervised fact-checking system for high-stakes decisions.

For claims that require image/video/audio evidence.

Failure Modes

Refusal/coverage gaps: high no-verdict rates reduce automation gains.

Surface heuristics: models may rely on tone or plausibility over evidence.

Core Entities

Models

GPT-4oGPT-3.5 TurboLLaMA 3.1 70BLLaMA 3.1 8BMixtral 8x7B

Metrics

Accuracyno-verdict percentageprecisionrecallF1-score

Datasets

FactSpanX-FactClaimReviewData Commons FeedFEVER (cited)

Benchmarks

FactSpan evaluation (this paper)