Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

February 13, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria de-Dios-Flores, Rodrigo Agerri

Links

Abstract / PDF

Why It Matters For Business

Multilingual truth checks are necessary: models can be less accurate in low-resource languages and simple automatic scores (MC2) can mislead decisions; judge-style evaluation and MT-enabled datasets reduce cost and improve cross-lingual assessment.

Summary TLDR

The authors professionally translated the 817-question TruthfulQA benchmark to Basque, Catalan, Galician and Spanish and evaluated 12 open LLMs (Llama 3/3.1 and Gemma 2 families) using human labels, multiple-choice scoring (MC2), and LLM-as-a-Judge. Key findings: English answers are most detailed and often most truthful; Basque (lowest-resource) is worst, but overall language gaps are smaller than expected; LLM-as-a-Judge agrees better with humans than MC2; instruction-tuned models are far more informative; high-quality machine translation produces similar evaluation results to professional translation. Data and code are published.

Problem Statement

TruthfulQA and most truthfulness tests are English-only. We need to know if LLM truthfulness holds across languages, how to evaluate it automatically across languages, and whether machine translation can cheaply scale truthfulness benchmarks.

Main Contribution

Professional translations of TruthfulQA into Basque, Catalan, Galician and Spanish (parallel dataset).

Large-scale evaluation of 12 open LLMs (base and instruct variants across 7B–70B) on multilingual TruthfulQA.

Comparison of three evaluation methods: human annotation, multiple-choice (MC2), and LLM-as-a-Judge; showing Judge-LLM aligns best with humans.

Empirical comparison of professional vs machine-translated datasets and analysis of universal vs time/context-dependent questions.

Public release of datasets, models and code under open licenses.

Key Findings

LLM-as-a-Judge correlates better with humans than MC2.

NumbersCohen Kappa: Judge (Gemma-2-9b-inst) vs human up to 0.75; MC2 lower

Best instruct model (Gemma-2-27b-it) reached ~79% average truth accuracy by Judge-LLM.

NumbersJudge-LLM avg = 79.0%

Instruction tuning increases informativeness and truthfulness.

NumbersHuman truthfulness instruct 62–73% vs base 36–58% on sampled models

Base models often give uninformative replies that inflate MC truth scores.

NumbersJudge-LLM informativeness for base models avg non‑en ≈ 85.4%; but some languages show low info boosting truth

Universal (time-independent) questions are much easier than time/context-dependent ones.

NumbersInstruct models hit near 90% on some universal questions (Judge-LLM)

Machine translation produces statistically similar evaluation outcomes to professional translation for this dataset.

NumbersChi-square p-values between MT and human translations in 0.18–0.78 (p>0.05)

Results

Accuracy

Value79.0%

Judge-LLM vs human agreement (Cohen Kappa)

Valueup to 0.75

BaselineMC2 lower agreement

Accuracy

Value≈57.7% (English instruct average)

Informativeness (base models average non-English)

Value≈85.4%

Who Should Care

What To Try In 7 Days

Run a judge-style (LLM-as-a-Judge) evaluation on your multilingual QA outputs and compare to MC2.

Measure informativeness separately for base models to detect inflated truth scores.

Generate an MT version of your evaluation set and spot-check a sample vs professional translation.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Languages cover only English plus four Iberian languages, limiting global generalization.
  • Judge-LLM is strong but does not fully replace nuanced human review.
  • TruthfulQA is static; it lacks regional and temporal updates that matter for many falsehoods.
  • MT results depend on availability of a strong MT model for the target language and the text genre.

When Not To Use

  • For languages not covered by high-quality MT or strongly underrepresented in training data without validation.
  • When subtle, localized, or legal time-sensitive facts require human-level adjudication.
  • As the sole evaluation method for deployed, high-stakes systems without a human audit.

Failure Modes

  • Judge-LLM may miss subtle factual nuances that human raters catch.
  • Base models outputting 'no comment' inflate MC-style truth scores.
  • MT can introduce synonyms or small changes that alter truth labels in edge cases.
  • Low-resource languages suffer from comprehension errors and cultural mismatches.

Core Entities

Models

  • Llama-3
  • Llama-3.1
  • Gemma-2
  • Llama-2-7B
  • Gemma-2-9b

Metrics

  • MC2 (multiple-choice scoring)
  • LLM-as-a-Judge
  • Cohen Kappa
  • Informativeness (binary)

Datasets

  • TruthfulQA (original English)
  • TruthfulQA (professional translations: Basque, Catalan, Galician, Spanish)
  • TruthfulQA (machine-translated via Claude 3.5 Sonnet)

Benchmarks

  • TruthfulQA
  • VeritasQA
  • SimpleQA