Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

February 13, 20257 min

Overview

Decision SnapshotNeeds Validation

The dataset and code are public and experiments are reproducible; main limits are language coverage (Iberian focus) and evaluation subtleties (judge vs human).

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria de-Dios-Flores, Rodrigo Agerri

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Multilingual truth checks are necessary: models can be less accurate in low-resource languages and simple automatic scores (MC2) can mislead decisions; judge-style evaluation and MT-enabled datasets reduce cost and improve cross-lingual assessment.

Who Should Care

Summary TLDR

The authors professionally translated the 817-question TruthfulQA benchmark to Basque, Catalan, Galician and Spanish and evaluated 12 open LLMs (Llama 3/3.1 and Gemma 2 families) using human labels, multiple-choice scoring (MC2), and LLM-as-a-Judge. Key findings: English answers are most detailed and often most truthful; Basque (lowest-resource) is worst, but overall language gaps are smaller than expected; LLM-as-a-Judge agrees better with humans than MC2; instruction-tuned models are far more informative; high-quality machine translation produces similar evaluation results to professional translation. Data and code are published.

Problem Statement

TruthfulQA and most truthfulness tests are English-only. We need to know if LLM truthfulness holds across languages, how to evaluate it automatically across languages, and whether machine translation can cheaply scale truthfulness benchmarks.

Main Contribution

Professional translations of TruthfulQA into Basque, Catalan, Galician and Spanish (parallel dataset).

Large-scale evaluation of 12 open LLMs (base and instruct variants across 7B–70B) on multilingual TruthfulQA.

Key Findings

LLM-as-a-Judge correlates better with humans than MC2.

NumbersCohen Kappa: Judge (Gemma-2-9b-inst) vs human up to 0.75; MC2 lower

Practical UsePrefer judge-style automated scoring over MC2 when evaluating generation truthfulness across languages.

Evidence RefTable 3; Figure 1

Best instruct model (Gemma-2-27b-it) reached ~79% average truth accuracy by Judge-LLM.

NumbersJudge-LLM avg = 79.0%

Practical UseUse top instruct models (e.g., Gemma-2 instruct) for multilingual truth-sensitive tasks when available.

Evidence RefTable 4 (Gemma-2-27b-it)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy79.0%Professional translations (all 5 languages)Table 4 reports Judge-LLM per-language and avg 79.0%Table 4
Judge-LLM vs human agreement (Cohen Kappa)up to 0.75MC2 lower agreementJudge > MC2400-instance human comparisonTable 3 shows Gemma-2-9b-inst trained on all data reaches Kappa 0.74–0.75Table 3; Figure 1

What To Try In 7 Days

Run a judge-style (LLM-as-a-Judge) evaluation on your multilingual QA outputs and compare to MC2.

Measure informativeness separately for base models to detect inflated truth scores.

Generate an MT version of your evaluation set and spot-check a sample vs professional translation.

Reproducibility

Risks & Boundaries

Limitations

Languages cover only English plus four Iberian languages, limiting global generalization.

Judge-LLM is strong but does not fully replace nuanced human review.

When Not To Use

For languages not covered by high-quality MT or strongly underrepresented in training data without validation.

When subtle, localized, or legal time-sensitive facts require human-level adjudication.

Failure Modes

Judge-LLM may miss subtle factual nuances that human raters catch.

Base models outputting 'no comment' inflate MC-style truth scores.

Core Entities

Models

Llama-3Llama-3.1Gemma-2Llama-2-7BGemma-2-9b

Metrics

MC2 (multiple-choice scoring)LLM-as-a-JudgeCohen KappaInformativeness (binary)

Datasets

TruthfulQA (original English)TruthfulQA (professional translations: Basque, Catalan, Galician, Spanish)TruthfulQA (machine-translated via Claude 3.5 Sonnet)

Benchmarks

TruthfulQAVeritasQASimpleQA