Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Multilingual truth checks are necessary: models can be less accurate in low-resource languages and simple automatic scores (MC2) can mislead decisions; judge-style evaluation and MT-enabled datasets reduce cost and improve cross-lingual assessment.
Summary TLDR
The authors professionally translated the 817-question TruthfulQA benchmark to Basque, Catalan, Galician and Spanish and evaluated 12 open LLMs (Llama 3/3.1 and Gemma 2 families) using human labels, multiple-choice scoring (MC2), and LLM-as-a-Judge. Key findings: English answers are most detailed and often most truthful; Basque (lowest-resource) is worst, but overall language gaps are smaller than expected; LLM-as-a-Judge agrees better with humans than MC2; instruction-tuned models are far more informative; high-quality machine translation produces similar evaluation results to professional translation. Data and code are published.
Problem Statement
TruthfulQA and most truthfulness tests are English-only. We need to know if LLM truthfulness holds across languages, how to evaluate it automatically across languages, and whether machine translation can cheaply scale truthfulness benchmarks.
Main Contribution
Professional translations of TruthfulQA into Basque, Catalan, Galician and Spanish (parallel dataset).
Large-scale evaluation of 12 open LLMs (base and instruct variants across 7B–70B) on multilingual TruthfulQA.
Comparison of three evaluation methods: human annotation, multiple-choice (MC2), and LLM-as-a-Judge; showing Judge-LLM aligns best with humans.
Empirical comparison of professional vs machine-translated datasets and analysis of universal vs time/context-dependent questions.
Public release of datasets, models and code under open licenses.
Key Findings
LLM-as-a-Judge correlates better with humans than MC2.
Best instruct model (Gemma-2-27b-it) reached ~79% average truth accuracy by Judge-LLM.
Instruction tuning increases informativeness and truthfulness.
Base models often give uninformative replies that inflate MC truth scores.
Universal (time-independent) questions are much easier than time/context-dependent ones.
Machine translation produces statistically similar evaluation outcomes to professional translation for this dataset.
Results
Accuracy
Judge-LLM vs human agreement (Cohen Kappa)
Accuracy
Informativeness (base models average non-English)
Who Should Care
What To Try In 7 Days
Run a judge-style (LLM-as-a-Judge) evaluation on your multilingual QA outputs and compare to MC2.
Measure informativeness separately for base models to detect inflated truth scores.
Generate an MT version of your evaluation set and spot-check a sample vs professional translation.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Languages cover only English plus four Iberian languages, limiting global generalization.
- Judge-LLM is strong but does not fully replace nuanced human review.
- TruthfulQA is static; it lacks regional and temporal updates that matter for many falsehoods.
- MT results depend on availability of a strong MT model for the target language and the text genre.
When Not To Use
- For languages not covered by high-quality MT or strongly underrepresented in training data without validation.
- When subtle, localized, or legal time-sensitive facts require human-level adjudication.
- As the sole evaluation method for deployed, high-stakes systems without a human audit.
Failure Modes
- Judge-LLM may miss subtle factual nuances that human raters catch.
- Base models outputting 'no comment' inflate MC-style truth scores.
- MT can introduce synonyms or small changes that alter truth labels in edge cases.
- Low-resource languages suffer from comprehension errors and cultural mismatches.
Core Entities
Models
- Llama-3
- Llama-3.1
- Gemma-2
- Llama-2-7B
- Gemma-2-9b
Metrics
- MC2 (multiple-choice scoring)
- LLM-as-a-Judge
- Cohen Kappa
- Informativeness (binary)
Datasets
- TruthfulQA (original English)
- TruthfulQA (professional translations: Basque, Catalan, Galician, Spanish)
- TruthfulQA (machine-translated via Claude 3.5 Sonnet)
Benchmarks
- TruthfulQA
- VeritasQA
- SimpleQA

