A new Hindi analogy test (HATS) shows multilingual LLMs reason better when prompted in English and still make language-specific mistakes.

July 17, 20257 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.25

Citation Count

0

Authors

Ashray Gupta, Rohan Joseph, Sunny Rai

Links

Abstract / PDF

Why It Matters For Business

You cannot assume multilingual models reason equally well in non-English languages. For product features that rely on conceptual reasoning (search, question answering, exam prep), prompt language and translation choices materially change accuracy and safety.

Summary TLDR

The authors built HATS, a 405-question Hindi analogy benchmark drawn from Indian exams, and tested three open multilingual LLMs (Aya-expanse-8B, Llama-3.1-8B, Gemma-2-9B). Models perform best when prompts are in English. A grounded Chain-of-Thought (CoT) prompt and a translate-then-solve CoT both help, but real Hindi reasoning remains weaker and error-prone (mistranslation, phonetic confusions). Dataset link provided.

Problem Statement

We lack native-language benchmarks to test whether multilingual LLMs can perform structured reasoning in Indic languages. Without such tests, we don't know if models truly generalize reasoning beyond English.

Main Contribution

HATS: a new Hindi Analogy Test Set of 405 multiple-choice semantic analogies sourced from Indian government exams (UPSC, SSC, PSC, Railway, Banking, etc.).

A benchmark of three multilingual LLMs (Aya-expanse-8B, Llama-3.1-8B, Gemma-2-9B) over multiple prompting styles (Hindi-only, English-only, mixed) and tasks (forced-choice, 0-shot, CoT, grounded CoT, few-shot with translation).

A grounded Chain-of-Thought (CoT) prompt inspired by cognitive theories (abduction, inductive mapping, adequacy evaluation) that modestly improves analogy accuracy.

Analysis of failure modes including mistranslation, phonetic confusion, instruction-following gaps in Hindi, and default 'I don't know' outputs.

Key Findings

English-only prompts give the best accuracy across models and settings.

NumbersTable 2: English-only top scores up to 79.75%

Gemma-2-9B achieved the highest single result: grounded 0-shot CoT in English reached 79.75% accuracy on HATS.

Numbers79.75% (Gemma, Grounded 0-Shot CoT, En+En)

Smaller models and Hindi prompts lower performance; LLaMA 3.1-8B reached 74.56% (best) while Aya peaked at 65.67% on English prompts.

NumbersLLaMA 74.56% (best); Aya 65.67% (best)

Grounded CoT gave only a small average bump over baseline CoT across experiments.

Numbers+0.27 percentage points average improvement

Language/translation-specific failures exist, including consistent phonetic mistranslation errors.

NumbersExample phonetic error observed in all 10 sampled failures

Results

Accuracy

ValueLLaMA 46.17% | Aya 42.96% | Gemma 43.20%

Accuracy

ValueGemma 79.75% | LLaMA 74.56% | Aya highest 65.67% (0-shot)

Baselinevarious 0-Shot/CoT baselines (see Table 2)

Average improvement: Grounded CoT vs 0-Shot CoT

Value+0.27 percentage points

Baseline0-Shot CoT

Model performance gap (Gemma vs others)

ValueGemma average +11.46 points across tasks

Baselineother models

Who Should Care

What To Try In 7 Days

Run your critical Hindi examples through an English translate-then-solve pipeline and compare accuracy to native-Hindi prompting.

Add a small validation filter for named entities and place names to catch phonetic mistranslation (e.g., 'ईंट' confusion).

Benchmark your target models on a held-out subset of HATS or similar in-domain examples before deployment.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only smaller model sizes (8B–9B) were evaluated; authors note larger models may perform differently.
  • HATS contains exam-style multiple-choice analogies (405 items) and may not cover other reasoning types or informal language.
  • Some prompts and evaluation relied on English prompts or translation, so performance mixes model reasoning with translation quality.
  • Authors report a small average improvement from grounded CoT (+0.27 points), so gains are modest in this setup.

When Not To Use

  • Do not generalize HATS results to large-model families (e.g., 70B+) without new tests.
  • Do not use HATS as a training corpus — it was constructed for evaluation of reasoning, not for model fine-tuning.
  • Avoid using grounded CoT alone to guarantee correct Hindi reasoning; combine with translation checks and validation.

Failure Modes

  • Mistranslation or phonetic confusion (example: ईंट 'brick' confused with English 'eat') leading to wrong answers.
  • Models identify A:B pairs but fail to transfer the same relation to C:D (broken relational mapping).
  • Models sometimes reply 'I don't know' or 'None of the above' even when a correct option is present.
  • Instruction-following gaps in Hindi prompts: some models perform worse with complex grounded CoT in Hindi.

Core Entities

Models

  • aya-expanse-8B
  • llama-3.1-8B
  • gemma-2-9B-it

Metrics

  • Accuracy

Datasets

  • HATS (Hindi Analogy Test Set, 405 items)

Benchmarks

  • Accuracy