Airavata: an open-source Hindi instruction-tuned LLM plus datasets and evaluations

Overview

Decision SnapshotNeeds Validation

The paper provides hands-on assets (model, data, code) and concrete evaluations; results are solid for NLU but limited for generation and high-stakes use without extra safety work.

Citations2

Evidence Strength0.60

Confidence0.77

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 30%

Authors

Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar M, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M. Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Airavata lowers the barrier to building Hindi language assistants by providing an open instruction-tuned model and data; use it for classification and assistant prototypes, but avoid high-stakes production without extra safety and factual checks.

Who Should Care

ML Engineer Product Manager Founder

Summary TLDR

Airavata is an open-source, Hindi instruction-tuned model built by LoRA fine-tuning the OpenHathi foundation model on a curated Hindi/English instruction mix (≈385k filtered examples). The team releases the training mixture, evaluation suite, and code. Airavata improves many Hindi NLU tasks versus the OpenHathi base model, shows mixed results on open-ended generation and translation, and trails closed-source models (GPT-4) on instruction-following and factual quality. The work highlights gaps in cross-lingual transfer and lists toxicity and truthfulness testing.

Problem Statement

Existing LLM progress is English-centric. Hindi LLMs lack instruction-tuned models, diverse training data, and robust evaluation. This paper builds and evaluates a Hindi instruction-tuned model and releases datasets and benchmarks to bootstrap research.

Main Contribution

Release Airavata: a Hindi instruction-tuned model built by LoRA finetuning OpenHathi.

Publish an instruction-tuning mixture (≈404k raw, 385k filtered Hindi/English examples) and two native Hindi instruction datasets (wikiHow, Anudesh).

Key Findings

Instruction tuning substantially improves Hindi NLU on several benchmarks versus base OpenHathi.

NumbersIndicXNLI 0-shot: OpenHathi 16.67 → Airavata 73.26 (+56.59) (Table 3)

Practical UseIf you need Hindi NLU accuracy, prefer an instruction-tuned Hindi model like Airavata over the OpenHathi base.

Evidence RefTable 3

Airavata improves sentiment and many NLU tasks but not translation or open-ended generation.

NumbersIndicSentiment 0-shot: 72.89 → 95.81 (+22.92); Flores chrF++: 55.41 → 54.82 (-0.59) (Tables 3,5)

Practical UseUse Airavata for classification and comprehension tasks; keep a stronger translation/generation model for high-quality MT or long-form generation.

Evidence RefTables 3 and 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
IndicSentiment (0-shot)	Airavata 95.81	OpenHathi 72.89	+22.92	IndicSentiment	Table 3: F1	Table 3
IndicXNLI (0-shot)	Airavata 73.26	OpenHathi 16.67	+56.59	IndicXNLI	Table 3: F1	Table 3

What To Try In 7 Days

Download Airavata and run zero-shot classification on Hindi customer intents to compare vs existing pipelines.

Use the released IndicInstruct dataset to fine-tune or LoRA-adapt your base model for specific Hindi tasks.

Run the provided evaluation suite (IndicXTREME + toxicity tests) on your own models to identify Hindi weaknesses.

Optimization Features

Token Efficiency

Plans to pack multiple dataset examples per fine-tuning example (future work)

Model Optimization

LoRA

Training Optimization

Checkpoint averaging (interpolation factor 0.6 between epoch 3 and 4)bfloat16 training, batch size 128, 4 epochs, LR 5e-4

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/AI4Bharat/IndicInstruct https://ai4bharat.github.io/airavata

Data URLs

https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1 https://huggingface.co/collections/ai4bharat/airavata-evaluation-suite https://huggingface.co/datasets/ai4bharat/human-eval

Risks & Boundaries

Limitations

Prone to hallucinations and factual errors; TruthfulQA scores remain low.

Open-ended generation and some NLG tasks lag behind stronger models.

When Not To Use

High-stakes or production systems without additional verification.

Tasks requiring high-quality machine translation or open-ended creative generation.

Failure Modes

Hallucinated facts or invented details in responses.

Weak or inconsistent open-ended generation quality.

Core Entities

Models

AiravataOpenHathiLlama 2 7B ChatBactrianX-llama-7BIndicTrans2

Metrics

chrF++BLEURTF1AccuracyRouge-LLikert scores (1-5)IFA/CNS/CQ rubric (0-2 scales)

Datasets

IndicInstruct (ai4bharat/indic-instruct-data-v0.1)FLAN-v2Anthropic-HHHDollyOpenAssistantLMSYS-Chat (sampled)wikiHow (native Hindi subset)AnudeshNMT (BPCC-Human)IndicXTREMEIndicNLG SuiteFloresMMLUBoolQHellaSwagARCWinograndeMultilingual HateCheckImplicit HateToxigenTruthfulQA

Benchmarks

IndicXTREMEIndicNLG SuiteMMLU (translated)HellaSwag (translated)ARC (translated)BoolQ (translated)Multilingual HateCheckTruthfulQA (translated subset)

Context Entities

Models

ChatGPT (GPT-3.5)GPT-4

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction tuning substantially improves Hindi NLU on several benchmarks versus base OpenHathi.

Airavata improves sentiment and many NLU tasks but not translation or open-ended generation.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding