LLMs make more mistakes on factual-sounding claims than on opinions across 61K multilingual fact-checks

June 4, 20257 min

Overview

Production Readiness

0.4

Novelty Score

0.55

Cost Impact Score

0.6

Citation Count

1

Authors

Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner

Links

Abstract / PDF

Why It Matters For Business

LLMs can help scale fact-checking but often miss factual claims and skip many judgments; companies should not rely on LLM-only pipelines for high-stakes verification.

Summary TLDR

The authors release FactSpan, a 61,514-claim multilingual fact-checking dataset (30 languages, 2007–2024) and use it to test five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B/8B, Mixtral 8x7B). GPT-4o gets the highest binary accuracy (73.31%) but refuses to judge 43% of claims. Across models, claims framed as factual statements are far harder to classify than opinions (example: GPT-3.5 error 41.2% for facts vs 21.3% for opinions). Performance also varies strongly by language and topic. The paper warns against blind deployment of LLM-only fact-checkers and provides dataset and code on Zenodo/GitHub.

Problem Statement

Current LLM fact-check evaluations focus narrowly on English and a few topics, leaving open whether models generalize across languages, topics, and claim styles. This paper asks: which claim features (language, topic, factual vs opinion, structure) influence LLM fact-checking accuracy in realistic multilingual data?

Main Contribution

FactSpan: a dynamically extensible multilingual fact-checking dataset with 61,514 verifiable, text-only claims across 30 languages and five topics.

A head-to-head evaluation of five LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 70B & 8B, Mixtral 8x7B) on the dataset, reporting accuracy and refusal rates per language and pre/post training cutoff.

A claim-feature analysis showing consistent patterns of misclassification, notably that factual statements are more error-prone than opinion statements, plus logistic regression identifying language, age, and label complexity as predictors of errors.

Key Findings

Large multilingual dataset released: FactSpan contains 61,514 claims in 30 languages.

NumbersTotal claims = 61,514; languages = 30

GPT-4o achieves the highest binary accuracy but often declines to judge claims.

NumbersAccuracy 73.31%; No-verdict 43.02%

GPT-3.5 Turbo shows good accuracy with fewer refusals than GPT-4o.

NumbersAccuracy 69.44%; No-verdict 16.57%

Open-source Mixtral and LLaMA models show lower accuracy and variable refusal behavior.

NumbersMixtral acc 53.41% (No-verdict 36.68%); LLaMA 3.1 8B acc 62.73% (No-verdict 10.36%)

Claims framed as facts are much harder to classify than opinion claims.

NumbersGPT-3.5 error: facts 41.2% vs opinions 21.3%

Language and topic strongly influence error rates.

NumbersHighest errors: Serbian 49.9%, Arabic 46%, Polish 41.9%; topic worst: Economy/Environment 37.1%

Closed-source models generalize to some post-cutoff claims but may use heuristics.

NumbersGPT-3.5 post-cutoff acc 79.75% vs pre 66.05%; GPT-4o post-cutoff 80.45% vs pre 72.57%

Results

Accuracy

Value73.31% (GPT-4o)

No-verdict rate (worst coverage)

Value43.02% (GPT-4o)

Accuracy

Value69.44%

Error rate on factual vs opinion claims (GPT-3.5)

ValueFacts error 41.2% vs Opinions error 21.3%

Language worst performers (error rate)

ValueSerbian 49.9%, Arabic 46%, Polish 41.9%

Accuracy

Value53.41% (Mixtral 8x7B)

Who Should Care

What To Try In 7 Days

Run your claims through FactSpan-sampled subset to measure per-language performance.

Mark claims framed as 'factual' and route them for evidence retrieval or human review.

Set a no-verdict threshold (e.g., >10%) to trigger manual escalation and log reasons for refusals.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Date as proxy for novelty: post-cutoff claims can reference earlier events, complicating interpretation.
  • Dataset skew: languages and topics biased toward active fact-checking communities.
  • Text-only focus: multimedia claims (images, video) excluded.
  • Annotation risk: some annotations relied on LLMs and a small manual validation sample.

When Not To Use

  • As a fully automated, unsupervised fact-checking system for high-stakes decisions.
  • For claims that require image/video/audio evidence.
  • For low-resource languages without additional language-specific validation.

Failure Modes

  • Refusal/coverage gaps: high no-verdict rates reduce automation gains.
  • Surface heuristics: models may rely on tone or plausibility over evidence.
  • Language bias: higher error rates in certain languages leading to unequal reliability.
  • Label-complexity confusion: 'partly true/misleading' cases are frequently misclassified.

Core Entities

Models

  • GPT-4o
  • GPT-3.5 Turbo
  • LLaMA 3.1 70B
  • LLaMA 3.1 8B
  • Mixtral 8x7B

Metrics

  • Accuracy
  • no-verdict percentage
  • precision
  • recall
  • F1-score

Datasets

  • FactSpan
  • X-Fact
  • ClaimReview
  • Data Commons Feed
  • FEVER (cited)

Benchmarks

  • FactSpan evaluation (this paper)