Overview
The dataset and code are public and experiments are reproducible; main limits are language coverage (Iberian focus) and evaluation subtleties (judge vs human).
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Multilingual truth checks are necessary: models can be less accurate in low-resource languages and simple automatic scores (MC2) can mislead decisions; judge-style evaluation and MT-enabled datasets reduce cost and improve cross-lingual assessment.
Who Should Care
Summary TLDR
The authors professionally translated the 817-question TruthfulQA benchmark to Basque, Catalan, Galician and Spanish and evaluated 12 open LLMs (Llama 3/3.1 and Gemma 2 families) using human labels, multiple-choice scoring (MC2), and LLM-as-a-Judge. Key findings: English answers are most detailed and often most truthful; Basque (lowest-resource) is worst, but overall language gaps are smaller than expected; LLM-as-a-Judge agrees better with humans than MC2; instruction-tuned models are far more informative; high-quality machine translation produces similar evaluation results to professional translation. Data and code are published.
Problem Statement
TruthfulQA and most truthfulness tests are English-only. We need to know if LLM truthfulness holds across languages, how to evaluate it automatically across languages, and whether machine translation can cheaply scale truthfulness benchmarks.
Main Contribution
Professional translations of TruthfulQA into Basque, Catalan, Galician and Spanish (parallel dataset).
Large-scale evaluation of 12 open LLMs (base and instruct variants across 7B–70B) on multilingual TruthfulQA.
Key Findings
LLM-as-a-Judge correlates better with humans than MC2.
Best instruct model (Gemma-2-27b-it) reached ~79% average truth accuracy by Judge-LLM.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 79.0% | — | — | Professional translations (all 5 languages) | Table 4 reports Judge-LLM per-language and avg 79.0% | Table 4 |
| Judge-LLM vs human agreement (Cohen Kappa) | up to 0.75 | MC2 lower agreement | Judge > MC2 | 400-instance human comparison | Table 3 shows Gemma-2-9b-inst trained on all data reaches Kappa 0.74–0.75 | Table 3; Figure 1 |
What To Try In 7 Days
Run a judge-style (LLM-as-a-Judge) evaluation on your multilingual QA outputs and compare to MC2.
Measure informativeness separately for base models to detect inflated truth scores.
Generate an MT version of your evaluation set and spot-check a sample vs professional translation.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Languages cover only English plus four Iberian languages, limiting global generalization.
Judge-LLM is strong but does not fully replace nuanced human review.
When Not To Use
For languages not covered by high-quality MT or strongly underrepresented in training data without validation.
When subtle, localized, or legal time-sensitive facts require human-level adjudication.
Failure Modes
Judge-LLM may miss subtle factual nuances that human raters catch.
Base models outputting 'no comment' inflate MC-style truth scores.

