Airavata: an open-source Hindi instruction-tuned LLM plus datasets and evaluations

January 26, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.3

Cost Impact Score

0.4

Citation Count

2

Authors

Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar M, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M. Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan

Links

Abstract / PDF

Why It Matters For Business

Airavata lowers the barrier to building Hindi language assistants by providing an open instruction-tuned model and data; use it for classification and assistant prototypes, but avoid high-stakes production without extra safety and factual checks.

Summary TLDR

Airavata is an open-source, Hindi instruction-tuned model built by LoRA fine-tuning the OpenHathi foundation model on a curated Hindi/English instruction mix (≈385k filtered examples). The team releases the training mixture, evaluation suite, and code. Airavata improves many Hindi NLU tasks versus the OpenHathi base model, shows mixed results on open-ended generation and translation, and trails closed-source models (GPT-4) on instruction-following and factual quality. The work highlights gaps in cross-lingual transfer and lists toxicity and truthfulness testing.

Problem Statement

Existing LLM progress is English-centric. Hindi LLMs lack instruction-tuned models, diverse training data, and robust evaluation. This paper builds and evaluates a Hindi instruction-tuned model and releases datasets and benchmarks to bootstrap research.

Main Contribution

Release Airavata: a Hindi instruction-tuned model built by LoRA finetuning OpenHathi.

Publish an instruction-tuning mixture (≈404k raw, 385k filtered Hindi/English examples) and two native Hindi instruction datasets (wikiHow, Anudesh).

Provide an evaluation suite: native Indic benchmarks, translated English benchmarks, toxicity and truthfulness tests, and a small human-eval protocol.

Report ablations (Full fine-tune vs LoRA), checkpoint interpolation (0.6 factor), and full hyperparameters for reproducibility.

Key Findings

Instruction tuning substantially improves Hindi NLU on several benchmarks versus base OpenHathi.

NumbersIndicXNLI 0-shot: OpenHathi 16.67 → Airavata 73.26 (+56.59) (Table 3)

Airavata improves sentiment and many NLU tasks but not translation or open-ended generation.

NumbersIndicSentiment 0-shot: 72.89 → 95.81 (+22.92); Flores chrF++: 55.41 → 54.82 (-0.59) (Tables 3,5)

Cross-lingual transfer from English to Hindi is limited, producing systematic gaps.

NumbersEnglish→Hindi gaps typically 5–15 points on evaluated tasks (paper observation, Table 4)

LoRA parameter-efficient tuning matched or outperformed full fine-tuning for mixed Hindi/English tasks.

NumbersAblation: LoRA gave consistent gains across Hindi NLU and English tasks (described in Sec 3.1)

Safety and truthfulness tests show mixed strengths: similar hate-detection but gaps on TruthfulQA.

NumbersMHC Hindi ~70.2 accuracy; TruthfulQA English: OpenHathi 30.72 → Airavata 33.60 (Table 9)

Results

IndicSentiment (0-shot)

ValueAiravata 95.81

BaselineOpenHathi 72.89

IndicXNLI (0-shot)

ValueAiravata 73.26

BaselineOpenHathi 16.67

Accuracy

ValueAiravata 41.39

BaselineOpenHathi 36.16

Flores (chrF++)

ValueAiravata 54.82

BaselineOpenHathi 55.41

Accuracy

ValueAiravata 70.24

BaselineOpenHathi 70.15

Who Should Care

What To Try In 7 Days

Download Airavata and run zero-shot classification on Hindi customer intents to compare vs existing pipelines.

Use the released IndicInstruct dataset to fine-tune or LoRA-adapt your base model for specific Hindi tasks.

Run the provided evaluation suite (IndicXTREME + toxicity tests) on your own models to identify Hindi weaknesses.

Optimization Features

Token Efficiency

  • Plans to pack multiple dataset examples per fine-tuning example (future work)

Model Optimization

  • LoRA

Training Optimization

  • Checkpoint averaging (interpolation factor 0.6 between epoch 3 and 4)
  • bfloat16 training, batch size 128, 4 epochs, LR 5e-4

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Prone to hallucinations and factual errors; TruthfulQA scores remain low.
  • Open-ended generation and some NLG tasks lag behind stronger models.
  • Cross-lingual transfer from English to Hindi shows 5–15 point gaps on many tasks.
  • Human evaluation is small-scale and single-annotator per item, so human metrics are preliminary.

When Not To Use

  • High-stakes or production systems without additional verification.
  • Tasks requiring high-quality machine translation or open-ended creative generation.
  • Applications demanding robust factual correctness out of the box.

Failure Modes

  • Hallucinated facts or invented details in responses.
  • Weak or inconsistent open-ended generation quality.
  • Limited knowledge transfer from English for specialized facts.
  • Possible vocabulary/tokenization gaps affecting code-mixed or regional terms.

Core Entities

Models

  • Airavata
  • OpenHathi
  • Llama 2 7B Chat
  • BactrianX-llama-7B
  • IndicTrans2

Metrics

  • chrF++
  • BLEURT
  • F1
  • Accuracy
  • Rouge-L
  • Likert scores (1-5)
  • IFA/CNS/CQ rubric (0-2 scales)

Datasets

  • IndicInstruct (ai4bharat/indic-instruct-data-v0.1)
  • FLAN-v2
  • Anthropic-HHH
  • Dolly
  • OpenAssistant
  • LMSYS-Chat (sampled)
  • wikiHow (native Hindi subset)
  • Anudesh
  • NMT (BPCC-Human)
  • IndicXTREME
  • IndicNLG Suite
  • Flores
  • MMLU
  • BoolQ
  • HellaSwag
  • ARC
  • Winogrande
  • Multilingual HateCheck
  • Implicit Hate
  • Toxigen
  • TruthfulQA

Benchmarks

  • IndicXTREME
  • IndicNLG Suite
  • MMLU (translated)
  • HellaSwag (translated)
  • ARC (translated)
  • BoolQ (translated)
  • Multilingual HateCheck
  • TruthfulQA (translated subset)

Context Entities

Models

  • ChatGPT (GPT-3.5)
  • GPT-4