LLMs make more mistakes on factual-sounding claims than on opinions across 61K multilingual fact-checks

Overview

Decision SnapshotNeeds Validation

The dataset and broad evaluation are well documented and reproducible; however, high refusal rates and language gaps reduce immediate production readiness.

Citations1

Evidence Strength0.85

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 55%

Authors

Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can help scale fact-checking but often miss factual claims and skip many judgments; companies should not rely on LLM-only pipelines for high-stakes verification.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead CEO

Summary TLDR

The authors release FactSpan, a 61,514-claim multilingual fact-checking dataset (30 languages, 2007–2024) and use it to test five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B/8B, Mixtral 8x7B). GPT-4o gets the highest binary accuracy (73.31%) but refuses to judge 43% of claims. Across models, claims framed as factual statements are far harder to classify than opinions (example: GPT-3.5 error 41.2% for facts vs 21.3% for opinions). Performance also varies strongly by language and topic. The paper warns against blind deployment of LLM-only fact-checkers and provides dataset and code on Zenodo/GitHub.

Problem Statement

Current LLM fact-check evaluations focus narrowly on English and a few topics, leaving open whether models generalize across languages, topics, and claim styles. This paper asks: which claim features (language, topic, factual vs opinion, structure) influence LLM fact-checking accuracy in realistic multilingual data?

Main Contribution

FactSpan: a dynamically extensible multilingual fact-checking dataset with 61,514 verifiable, text-only claims across 30 languages and five topics.

A head-to-head evaluation of five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B & 8B, Mixtral 8x7B) on the dataset, reporting accuracy and refusal rates per language and pre/post training cutoff.

Key Findings

Large multilingual dataset released: FactSpan contains 61,514 claims in 30 languages.

NumbersTotal claims = 61,514; languages = 30

Practical UseUse this dataset to evaluate or stress-test multilingual fact-checking systems instead of relying on small, English-only benchmarks.

Evidence RefSection 3.1, Table 1

GPT-4o achieves the highest binary accuracy but often declines to judge claims.

NumbersAccuracy 73.31%; No-verdict 43.02%

Practical UseGPT-4o is relatively accurate when it answers, but expect large coverage gaps; design systems to route no-verdicts to human reviewers or fallback checks.

Evidence RefAbstract; Section 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	73.31% (GPT-4o)	—	—	FactSpan (61,514 claims)	GPT-4o had highest binary accuracy across evaluated models	Section 4.1
No-verdict rate (worst coverage)	43.02% (GPT-4o)	—	—	FactSpan	GPT-4o declined to judge a large share of claims	Abstract; Section 4.1

What To Try In 7 Days

Run your claims through FactSpan-sampled subset to measure per-language performance.

Mark claims framed as 'factual' and route them for evidence retrieval or human review.

Set a no-verdict threshold (e.g., >10%) to trigger manual escalation and log reasons for refusals.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/lorraine-dev/FactSpan

Data URLs

https://zenodo.org/records/15084388 https://doi.org/10.5281/zenodo.15084388

Risks & Boundaries

Limitations

Date as proxy for novelty: post-cutoff claims can reference earlier events, complicating interpretation.

Dataset skew: languages and topics biased toward active fact-checking communities.

When Not To Use

As a fully automated, unsupervised fact-checking system for high-stakes decisions.

For claims that require image/video/audio evidence.

Failure Modes

Refusal/coverage gaps: high no-verdict rates reduce automation gains.

Surface heuristics: models may rely on tone or plausibility over evidence.

Core Entities

Models

GPT-4oGPT-3.5 TurboLLaMA 3.1 70BLLaMA 3.1 8BMixtral 8x7B

Metrics

Accuracyno-verdict percentageprecisionrecallF1-score

Datasets

FactSpanX-FactClaimReviewData Commons FeedFEVER (cited)

Benchmarks

FactSpan evaluation (this paper)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large multilingual dataset released: FactSpan contains 61,514 claims in 30 languages.

GPT-4o achieves the highest binary accuracy but often declines to judge claims.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding