A new Hindi analogy test (HATS) shows multilingual LLMs reason better when prompted in English and still make language-specific mistakes.

Overview

Decision SnapshotNeeds Validation

The dataset and experiments are concrete and reproducible from prompts in the appendix, but results use only 8–9B models and English-prompt bias is strong. Scores reflect practical usefulness for evaluation but limited generality to larger models.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 40%

Novelty: 30%

Authors

Ashray Gupta, Rohan Joseph, Sunny Rai

Links

Abstract / PDF / Data

Why It Matters For Business

You cannot assume multilingual models reason equally well in non-English languages. For product features that rely on conceptual reasoning (search, question answering, exam prep), prompt language and translation choices materially change accuracy and safety.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

The authors built HATS, a 405-question Hindi analogy benchmark drawn from Indian exams, and tested three open multilingual LLMs (Aya-expanse-8B, Llama-3.1-8B, Gemma-2-9B). Models perform best when prompts are in English. A grounded Chain-of-Thought (CoT) prompt and a translate-then-solve CoT both help, but real Hindi reasoning remains weaker and error-prone (mistranslation, phonetic confusions). Dataset link provided.

Problem Statement

We lack native-language benchmarks to test whether multilingual LLMs can perform structured reasoning in Indic languages. Without such tests, we don't know if models truly generalize reasoning beyond English.

Main Contribution

HATS: a new Hindi Analogy Test Set of 405 multiple-choice semantic analogies sourced from Indian government exams (UPSC, SSC, PSC, Railway, Banking, etc.).

A benchmark of three multilingual LLMs (Aya-expanse-8B, Llama-3.1-8B, Gemma-2-9B) over multiple prompting styles (Hindi-only, English-only, mixed) and tasks (forced-choice, 0-shot, CoT, grounded CoT, few-shot with translation).

Key Findings

English-only prompts give the best accuracy across models and settings.

NumbersTable 2: English-only top scores up to 79.75%

Practical UseWhen building NLP systems that rely on reasoning in low-resource languages, try English prompts or a translate-then-solve pipeline to boost correctness quickly.

Evidence RefTable 2; Sec 3.6

Gemma-2-9B achieved the highest single result: grounded 0-shot CoT in English reached 79.75% accuracy on HATS.

Numbers79.75% (Gemma, Grounded 0-Shot CoT, En+En)

Practical UsePrefer stronger multilingual models like Gemma and grounded CoT when accuracy on Hindi conceptual reasoning matters; expect up to ~80% on this test set with En prompts.

Evidence RefSec 3.6; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	LLaMA 46.17% \| Aya 42.96% \| Gemma 43.20%	—	—	HATS (all samples)	Table 1: Task A results	Table 1
Accuracy	Gemma 79.75% \| LLaMA 74.56% \| Aya highest 65.67% (0-shot)	various 0-Shot/CoT baselines (see Table 2)	—	HATS (valid analogies)	Table 2; Sec 3.6	Table 2

What To Try In 7 Days

Run your critical Hindi examples through an English translate-then-solve pipeline and compare accuracy to native-Hindi prompting.

Add a small validation filter for named entities and place names to catch phonetic mistranslation (e.g., 'ईंट' confusion).

Benchmark your target models on a held-out subset of HATS or similar in-domain examples before deployment.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/Inequilazitive/HATS-Hindi_Analogy_Test_Set

Risks & Boundaries

Limitations

Only smaller model sizes (8B–9B) were evaluated; authors note larger models may perform differently.

HATS contains exam-style multiple-choice analogies (405 items) and may not cover other reasoning types or informal language.

When Not To Use

Do not generalize HATS results to large-model families (e.g., 70B+) without new tests.

Do not use HATS as a training corpus — it was constructed for evaluation of reasoning, not for model fine-tuning.

Failure Modes

Mistranslation or phonetic confusion (example: ईंट 'brick' confused with English 'eat') leading to wrong answers.

Models identify A:B pairs but fail to transfer the same relation to C:D (broken relational mapping).

Core Entities

Models

aya-expanse-8Bllama-3.1-8Bgemma-2-9B-it

Metrics

Accuracy

Datasets

HATS (Hindi Analogy Test Set, 405 items)

Benchmarks

Accuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

English-only prompts give the best accuracy across models and settings.

Gemma-2-9B achieved the highest single result: grounded 0-shot CoT in English reached 79.75% accuracy on HATS.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding