Overview
The dataset and experiments are concrete and reproducible from prompts in the appendix, but results use only 8–9B models and English-prompt bias is strong. Scores reflect practical usefulness for evaluation but limited generality to larger models.
Citations0
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 25%
Production readiness: 40%
Novelty: 30%
Why It Matters For Business
You cannot assume multilingual models reason equally well in non-English languages. For product features that rely on conceptual reasoning (search, question answering, exam prep), prompt language and translation choices materially change accuracy and safety.
Who Should Care
Summary TLDR
The authors built HATS, a 405-question Hindi analogy benchmark drawn from Indian exams, and tested three open multilingual LLMs (Aya-expanse-8B, Llama-3.1-8B, Gemma-2-9B). Models perform best when prompts are in English. A grounded Chain-of-Thought (CoT) prompt and a translate-then-solve CoT both help, but real Hindi reasoning remains weaker and error-prone (mistranslation, phonetic confusions). Dataset link provided.
Problem Statement
We lack native-language benchmarks to test whether multilingual LLMs can perform structured reasoning in Indic languages. Without such tests, we don't know if models truly generalize reasoning beyond English.
Main Contribution
HATS: a new Hindi Analogy Test Set of 405 multiple-choice semantic analogies sourced from Indian government exams (UPSC, SSC, PSC, Railway, Banking, etc.).
A benchmark of three multilingual LLMs (Aya-expanse-8B, Llama-3.1-8B, Gemma-2-9B) over multiple prompting styles (Hindi-only, English-only, mixed) and tasks (forced-choice, 0-shot, CoT, grounded CoT, few-shot with translation).
Key Findings
English-only prompts give the best accuracy across models and settings.
Gemma-2-9B achieved the highest single result: grounded 0-shot CoT in English reached 79.75% accuracy on HATS.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | LLaMA 46.17% | Aya 42.96% | Gemma 43.20% | — | — | HATS (all samples) | Table 1: Task A results | Table 1 |
| Accuracy | Gemma 79.75% | LLaMA 74.56% | Aya highest 65.67% (0-shot) | various 0-Shot/CoT baselines (see Table 2) | — | HATS (valid analogies) | Table 2; Sec 3.6 | Table 2 |
What To Try In 7 Days
Run your critical Hindi examples through an English translate-then-solve pipeline and compare accuracy to native-Hindi prompting.
Add a small validation filter for named entities and place names to catch phonetic mistranslation (e.g., 'ईंट' confusion).
Benchmark your target models on a held-out subset of HATS or similar in-domain examples before deployment.
Reproducibility
Risks & Boundaries
Limitations
Only smaller model sizes (8B–9B) were evaluated; authors note larger models may perform differently.
HATS contains exam-style multiple-choice analogies (405 items) and may not cover other reasoning types or informal language.
When Not To Use
Do not generalize HATS results to large-model families (e.g., 70B+) without new tests.
Do not use HATS as a training corpus — it was constructed for evaluation of reasoning, not for model fine-tuning.
Failure Modes
Mistranslation or phonetic confusion (example: ईंट 'brick' confused with English 'eat') leading to wrong answers.
Models identify A:B pairs but fail to transfer the same relation to C:D (broken relational mapping).

