Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Multilingual users face higher truthfulness risk: Filipino outputs are ~10–12 points less accurate on TruthfulQA-style checks, so companies should validate models in local languages before deployment.
Summary TLDR
The authors release KatotohananQA, a human-verified Filipino translation of the binary-choice TruthfulQA (790 questions). They evaluate seven free-tier proprietary LLMs and find a consistent drop in truthfulness when moving from English to Filipino (average English 94.72% vs Filipino 83.87%, mean gap ~10.85%). GPT‑5 matched English performance in Filipino; some models (DeepSeek V3, Gemini Flash) lost >25 points. Differences are larger on adversarial and reasoning-style questions. The dataset and code are publicly available.
Problem Statement
TruthfulQA measures model truthfulness but exists only in English. Low-resource languages like Filipino are under‑represented in pretraining data, so the authors create a Filipino parallel to measure how model truthfulness transfers across languages.
Main Contribution
KatotohananQA: human-verified Filipino parallel of TruthfulQA (790 binary-choice items).
Evaluation of seven free-tier proprietary LLMs across English vs Filipino truthfulness.
Analysis by question type, category, and topic with statistical tests (McNemar, Cohen's g).
Public release of the dataset and evaluation prompts for reproducibility.
Key Findings
Models are less truthful in Filipino on average.
Model robustness varies widely by vendor and release.
Adversarial questions suffer larger multilingual drop.
Certain categories/topics show much larger gaps.
Differences are statistically significant in most groupings.
Results
Accuracy
Accuracy
Largest per-model drop
Model with no drop
Adversarial vs Non-Adversarial
Categories with largest gap
Statistical significance summary
Who Should Care
What To Try In 7 Days
Run KatotohananQA on your models to measure Filipino truthfulness.
Compare English vs Filipino responses to spot large language gaps.
If gaps appear, prefer newer multilingual models or add local data/filters for high-risk question types.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Binary-choice format limits evaluation of open-ended generation.
- Translation retains Western context; cultural mismatch may affect answers.
- Evaluated only seven proprietary models; not representative of all models.
- Dialectal diversity of Filipino not covered.
When Not To Use
- When you need open-ended hallucination analysis rather than binary judgments.
- When evaluating dialect-specific Filipino behavior.
- When assessing models outside the tested proprietary family set.
Failure Modes
- Translation artifacts changing question meaning.
- Model language-skew from low Filipino pretraining data.
- Binary forced-choice causing format-based heuristics.
- Judge bias from automatic parsing of single-letter answers.
Core Entities
Models
- GPT-5
- GPT-5 Mini
- Gemini 2.5 Pro
- Gemini 2.5 Flash
- DeepSeek R1
- DeepSeek V3
- Claude Sonnet 4
Metrics
- Accuracy
- McNemar's test
- Cohen's g
Datasets
- KatotohananQA
- TruthfulQA
Benchmarks
- TruthfulQA
Context Entities
Models
- OpenAI GPT series
- Google DeepMind Gemini series
- Anthropic Claude family
- DeepSeek models
Metrics
- binary-choice evaluation
Datasets
- Common Crawl (language stats referenced)
Benchmarks
- VeritasQA
- Uhura (related work)

