Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Overview

Decision SnapshotNeeds Validation

The dataset and code are public and experiments are reproducible; main limits are language coverage (Iberian focus) and evaluation subtleties (judge vs human).

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria de-Dios-Flores, Rodrigo Agerri

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Multilingual truth checks are necessary: models can be less accurate in low-resource languages and simple automatic scores (MC2) can mislead decisions; judge-style evaluation and MT-enabled datasets reduce cost and improve cross-lingual assessment.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead

Summary TLDR

The authors professionally translated the 817-question TruthfulQA benchmark to Basque, Catalan, Galician and Spanish and evaluated 12 open LLMs (Llama 3/3.1 and Gemma 2 families) using human labels, multiple-choice scoring (MC2), and LLM-as-a-Judge. Key findings: English answers are most detailed and often most truthful; Basque (lowest-resource) is worst, but overall language gaps are smaller than expected; LLM-as-a-Judge agrees better with humans than MC2; instruction-tuned models are far more informative; high-quality machine translation produces similar evaluation results to professional translation. Data and code are published.

Problem Statement

TruthfulQA and most truthfulness tests are English-only. We need to know if LLM truthfulness holds across languages, how to evaluate it automatically across languages, and whether machine translation can cheaply scale truthfulness benchmarks.

Main Contribution

Professional translations of TruthfulQA into Basque, Catalan, Galician and Spanish (parallel dataset).

Large-scale evaluation of 12 open LLMs (base and instruct variants across 7B–70B) on multilingual TruthfulQA.

Key Findings

LLM-as-a-Judge correlates better with humans than MC2.

NumbersCohen Kappa: Judge (Gemma-2-9b-inst) vs human up to 0.75; MC2 lower

Practical UsePrefer judge-style automated scoring over MC2 when evaluating generation truthfulness across languages.

Evidence RefTable 3; Figure 1

Best instruct model (Gemma-2-27b-it) reached ~79% average truth accuracy by Judge-LLM.

NumbersJudge-LLM avg = 79.0%

Practical UseUse top instruct models (e.g., Gemma-2 instruct) for multilingual truth-sensitive tasks when available.

Evidence RefTable 4 (Gemma-2-27b-it)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	79.0%	—	—	Professional translations (all 5 languages)	Table 4 reports Judge-LLM per-language and avg 79.0%	Table 4
Judge-LLM vs human agreement (Cohen Kappa)	up to 0.75	MC2 lower agreement	Judge > MC2	400-instance human comparison	Table 3 shows Gemma-2-9b-inst trained on all data reaches Kappa 0.74–0.75	Table 3; Figure 1

What To Try In 7 Days

Run a judge-style (LLM-as-a-Judge) evaluation on your multilingual QA outputs and compare to MC2.

Measure informativeness separately for base models to detect inflated truth scores.

Generate an MT version of your evaluation set and spot-check a sample vs professional translation.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/hitz-zentroa/truthfulqa-multi https://hf.co/collections/HiTZ/multilingual-truthfulqa-682f33d0d1d5a60d13604eb6

Data URLs

https://hf.co/collections/HiTZ/multilingual-truthfulqa-682f33d0d1d5a60d13604eb6 https://github.com/hitz-zentroa/truthfulqa-multi

Risks & Boundaries

Limitations

Languages cover only English plus four Iberian languages, limiting global generalization.

Judge-LLM is strong but does not fully replace nuanced human review.

When Not To Use

For languages not covered by high-quality MT or strongly underrepresented in training data without validation.

When subtle, localized, or legal time-sensitive facts require human-level adjudication.

Failure Modes

Judge-LLM may miss subtle factual nuances that human raters catch.

Base models outputting 'no comment' inflate MC-style truth scores.

Core Entities

Models

Llama-3Llama-3.1Gemma-2Llama-2-7BGemma-2-9b

Metrics

MC2 (multiple-choice scoring)LLM-as-a-JudgeCohen KappaInformativeness (binary)

Datasets

TruthfulQA (original English)TruthfulQA (professional translations: Basque, Catalan, Galician, Spanish)TruthfulQA (machine-translated via Claude 3.5 Sonnet)

Benchmarks

TruthfulQAVeritasQASimpleQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM-as-a-Judge correlates better with humans than MC2.

Best instruct model (Gemma-2-27b-it) reached ~79% average truth accuracy by Judge-LLM.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding

KatotohananQA: Filipino TruthfulQA shows ~10–12% accuracy drop vs English; GPT‑5 is multilingual-robust

Key finding