A new Hindi analogy test (HATS) shows multilingual LLMs reason better when prompted in English and still make language-specific mistakes.

July 17, 20257 min

Overview

Decision SnapshotNeeds Validation

The dataset and experiments are concrete and reproducible from prompts in the appendix, but results use only 8–9B models and English-prompt bias is strong. Scores reflect practical usefulness for evaluation but limited generality to larger models.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 40%

Novelty: 30%

Authors

Ashray Gupta, Rohan Joseph, Sunny Rai

Links

Abstract / PDF / Data

Why It Matters For Business

You cannot assume multilingual models reason equally well in non-English languages. For product features that rely on conceptual reasoning (search, question answering, exam prep), prompt language and translation choices materially change accuracy and safety.

Who Should Care

Summary TLDR

The authors built HATS, a 405-question Hindi analogy benchmark drawn from Indian exams, and tested three open multilingual LLMs (Aya-expanse-8B, Llama-3.1-8B, Gemma-2-9B). Models perform best when prompts are in English. A grounded Chain-of-Thought (CoT) prompt and a translate-then-solve CoT both help, but real Hindi reasoning remains weaker and error-prone (mistranslation, phonetic confusions). Dataset link provided.

Problem Statement

We lack native-language benchmarks to test whether multilingual LLMs can perform structured reasoning in Indic languages. Without such tests, we don't know if models truly generalize reasoning beyond English.

Main Contribution

HATS: a new Hindi Analogy Test Set of 405 multiple-choice semantic analogies sourced from Indian government exams (UPSC, SSC, PSC, Railway, Banking, etc.).

A benchmark of three multilingual LLMs (Aya-expanse-8B, Llama-3.1-8B, Gemma-2-9B) over multiple prompting styles (Hindi-only, English-only, mixed) and tasks (forced-choice, 0-shot, CoT, grounded CoT, few-shot with translation).

Key Findings

English-only prompts give the best accuracy across models and settings.

NumbersTable 2: English-only top scores up to 79.75%

Practical UseWhen building NLP systems that rely on reasoning in low-resource languages, try English prompts or a translate-then-solve pipeline to boost correctness quickly.

Evidence RefTable 2; Sec 3.6

Gemma-2-9B achieved the highest single result: grounded 0-shot CoT in English reached 79.75% accuracy on HATS.

Numbers79.75% (Gemma, Grounded 0-Shot CoT, En+En)

Practical UsePrefer stronger multilingual models like Gemma and grounded CoT when accuracy on Hindi conceptual reasoning matters; expect up to ~80% on this test set with En prompts.

Evidence RefSec 3.6; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyLLaMA 46.17% | Aya 42.96% | Gemma 43.20%HATS (all samples)Table 1: Task A resultsTable 1
AccuracyGemma 79.75% | LLaMA 74.56% | Aya highest 65.67% (0-shot)various 0-Shot/CoT baselines (see Table 2)HATS (valid analogies)Table 2; Sec 3.6Table 2

What To Try In 7 Days

Run your critical Hindi examples through an English translate-then-solve pipeline and compare accuracy to native-Hindi prompting.

Add a small validation filter for named entities and place names to catch phonetic mistranslation (e.g., 'ईंट' confusion).

Benchmark your target models on a held-out subset of HATS or similar in-domain examples before deployment.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only smaller model sizes (8B–9B) were evaluated; authors note larger models may perform differently.

HATS contains exam-style multiple-choice analogies (405 items) and may not cover other reasoning types or informal language.

When Not To Use

Do not generalize HATS results to large-model families (e.g., 70B+) without new tests.

Do not use HATS as a training corpus — it was constructed for evaluation of reasoning, not for model fine-tuning.

Failure Modes

Mistranslation or phonetic confusion (example: ईंट 'brick' confused with English 'eat') leading to wrong answers.

Models identify A:B pairs but fail to transfer the same relation to C:D (broken relational mapping).

Core Entities

Models

aya-expanse-8Bllama-3.1-8Bgemma-2-9B-it

Metrics

Accuracy

Datasets

HATS (Hindi Analogy Test Set, 405 items)

Benchmarks

Accuracy