Airavata: an open-source Hindi instruction-tuned LLM plus datasets and evaluations

January 26, 20247 min

Overview

Decision SnapshotNeeds Validation

The paper provides hands-on assets (model, data, code) and concrete evaluations; results are solid for NLU but limited for generation and high-stakes use without extra safety work.

Citations2

Evidence Strength0.60

Confidence0.77

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 30%

Authors

Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar M, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M. Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Airavata lowers the barrier to building Hindi language assistants by providing an open instruction-tuned model and data; use it for classification and assistant prototypes, but avoid high-stakes production without extra safety and factual checks.

Who Should Care

Summary TLDR

Airavata is an open-source, Hindi instruction-tuned model built by LoRA fine-tuning the OpenHathi foundation model on a curated Hindi/English instruction mix (≈385k filtered examples). The team releases the training mixture, evaluation suite, and code. Airavata improves many Hindi NLU tasks versus the OpenHathi base model, shows mixed results on open-ended generation and translation, and trails closed-source models (GPT-4) on instruction-following and factual quality. The work highlights gaps in cross-lingual transfer and lists toxicity and truthfulness testing.

Problem Statement

Existing LLM progress is English-centric. Hindi LLMs lack instruction-tuned models, diverse training data, and robust evaluation. This paper builds and evaluates a Hindi instruction-tuned model and releases datasets and benchmarks to bootstrap research.

Main Contribution

Release Airavata: a Hindi instruction-tuned model built by LoRA finetuning OpenHathi.

Publish an instruction-tuning mixture (≈404k raw, 385k filtered Hindi/English examples) and two native Hindi instruction datasets (wikiHow, Anudesh).

Key Findings

Instruction tuning substantially improves Hindi NLU on several benchmarks versus base OpenHathi.

NumbersIndicXNLI 0-shot: OpenHathi 16.67 → Airavata 73.26 (+56.59) (Table 3)

Practical UseIf you need Hindi NLU accuracy, prefer an instruction-tuned Hindi model like Airavata over the OpenHathi base.

Evidence RefTable 3

Airavata improves sentiment and many NLU tasks but not translation or open-ended generation.

NumbersIndicSentiment 0-shot: 72.8995.81 (+22.92); Flores chrF++: 55.4154.82 (-0.59) (Tables 3,5)

Practical UseUse Airavata for classification and comprehension tasks; keep a stronger translation/generation model for high-quality MT or long-form generation.

Evidence RefTables 3 and 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
IndicSentiment (0-shot)Airavata 95.81OpenHathi 72.89+22.92IndicSentimentTable 3: F1Table 3
IndicXNLI (0-shot)Airavata 73.26OpenHathi 16.67+56.59IndicXNLITable 3: F1Table 3

What To Try In 7 Days

Download Airavata and run zero-shot classification on Hindi customer intents to compare vs existing pipelines.

Use the released IndicInstruct dataset to fine-tune or LoRA-adapt your base model for specific Hindi tasks.

Run the provided evaluation suite (IndicXTREME + toxicity tests) on your own models to identify Hindi weaknesses.

Optimization Features

Token Efficiency
Plans to pack multiple dataset examples per fine-tuning example (future work)
Model Optimization
LoRA
Training Optimization
Checkpoint averaging (interpolation factor 0.6 between epoch 3 and 4)bfloat16 training, batch size 128, 4 epochs, LR 5e-4

Reproducibility

Risks & Boundaries

Limitations

Prone to hallucinations and factual errors; TruthfulQA scores remain low.

Open-ended generation and some NLG tasks lag behind stronger models.

When Not To Use

High-stakes or production systems without additional verification.

Tasks requiring high-quality machine translation or open-ended creative generation.

Failure Modes

Hallucinated facts or invented details in responses.

Weak or inconsistent open-ended generation quality.

Core Entities

Models

AiravataOpenHathiLlama 2 7B ChatBactrianX-llama-7BIndicTrans2

Metrics

chrF++BLEURTF1AccuracyRouge-LLikert scores (1-5)IFA/CNS/CQ rubric (0-2 scales)

Datasets

IndicInstruct (ai4bharat/indic-instruct-data-v0.1)FLAN-v2Anthropic-HHHDollyOpenAssistantLMSYS-Chat (sampled)wikiHow (native Hindi subset)AnudeshNMT (BPCC-Human)IndicXTREMEIndicNLG SuiteFloresMMLUBoolQHellaSwagARCWinograndeMultilingual HateCheckImplicit HateToxigenTruthfulQA

Benchmarks

IndicXTREMEIndicNLG SuiteMMLU (translated)HellaSwag (translated)ARC (translated)BoolQ (translated)Multilingual HateCheckTruthfulQA (translated subset)

Context Entities

Models

ChatGPT (GPT-3.5)GPT-4