Overview
Production Readiness
0.3
Novelty Score
0.3
Cost Impact Score
0.4
Citation Count
2
Why It Matters For Business
Airavata lowers the barrier to building Hindi language assistants by providing an open instruction-tuned model and data; use it for classification and assistant prototypes, but avoid high-stakes production without extra safety and factual checks.
Summary TLDR
Airavata is an open-source, Hindi instruction-tuned model built by LoRA fine-tuning the OpenHathi foundation model on a curated Hindi/English instruction mix (≈385k filtered examples). The team releases the training mixture, evaluation suite, and code. Airavata improves many Hindi NLU tasks versus the OpenHathi base model, shows mixed results on open-ended generation and translation, and trails closed-source models (GPT-4) on instruction-following and factual quality. The work highlights gaps in cross-lingual transfer and lists toxicity and truthfulness testing.
Problem Statement
Existing LLM progress is English-centric. Hindi LLMs lack instruction-tuned models, diverse training data, and robust evaluation. This paper builds and evaluates a Hindi instruction-tuned model and releases datasets and benchmarks to bootstrap research.
Main Contribution
Release Airavata: a Hindi instruction-tuned model built by LoRA finetuning OpenHathi.
Publish an instruction-tuning mixture (≈404k raw, 385k filtered Hindi/English examples) and two native Hindi instruction datasets (wikiHow, Anudesh).
Provide an evaluation suite: native Indic benchmarks, translated English benchmarks, toxicity and truthfulness tests, and a small human-eval protocol.
Report ablations (Full fine-tune vs LoRA), checkpoint interpolation (0.6 factor), and full hyperparameters for reproducibility.
Key Findings
Instruction tuning substantially improves Hindi NLU on several benchmarks versus base OpenHathi.
Airavata improves sentiment and many NLU tasks but not translation or open-ended generation.
Cross-lingual transfer from English to Hindi is limited, producing systematic gaps.
LoRA parameter-efficient tuning matched or outperformed full fine-tuning for mixed Hindi/English tasks.
Safety and truthfulness tests show mixed strengths: similar hate-detection but gaps on TruthfulQA.
Results
IndicSentiment (0-shot)
IndicXNLI (0-shot)
Accuracy
Flores (chrF++)
Accuracy
Who Should Care
What To Try In 7 Days
Download Airavata and run zero-shot classification on Hindi customer intents to compare vs existing pipelines.
Use the released IndicInstruct dataset to fine-tune or LoRA-adapt your base model for specific Hindi tasks.
Run the provided evaluation suite (IndicXTREME + toxicity tests) on your own models to identify Hindi weaknesses.
Optimization Features
Token Efficiency
- Plans to pack multiple dataset examples per fine-tuning example (future work)
Model Optimization
- LoRA
Training Optimization
- Checkpoint averaging (interpolation factor 0.6 between epoch 3 and 4)
- bfloat16 training, batch size 128, 4 epochs, LR 5e-4
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Prone to hallucinations and factual errors; TruthfulQA scores remain low.
- Open-ended generation and some NLG tasks lag behind stronger models.
- Cross-lingual transfer from English to Hindi shows 5–15 point gaps on many tasks.
- Human evaluation is small-scale and single-annotator per item, so human metrics are preliminary.
When Not To Use
- High-stakes or production systems without additional verification.
- Tasks requiring high-quality machine translation or open-ended creative generation.
- Applications demanding robust factual correctness out of the box.
Failure Modes
- Hallucinated facts or invented details in responses.
- Weak or inconsistent open-ended generation quality.
- Limited knowledge transfer from English for specialized facts.
- Possible vocabulary/tokenization gaps affecting code-mixed or regional terms.
Core Entities
Models
- Airavata
- OpenHathi
- Llama 2 7B Chat
- BactrianX-llama-7B
- IndicTrans2
Metrics
- chrF++
- BLEURT
- F1
- Accuracy
- Rouge-L
- Likert scores (1-5)
- IFA/CNS/CQ rubric (0-2 scales)
Datasets
- IndicInstruct (ai4bharat/indic-instruct-data-v0.1)
- FLAN-v2
- Anthropic-HHH
- Dolly
- OpenAssistant
- LMSYS-Chat (sampled)
- wikiHow (native Hindi subset)
- Anudesh
- NMT (BPCC-Human)
- IndicXTREME
- IndicNLG Suite
- Flores
- MMLU
- BoolQ
- HellaSwag
- ARC
- Winogrande
- Multilingual HateCheck
- Implicit Hate
- Toxigen
- TruthfulQA
Benchmarks
- IndicXTREME
- IndicNLG Suite
- MMLU (translated)
- HellaSwag (translated)
- ARC (translated)
- BoolQ (translated)
- Multilingual HateCheck
- TruthfulQA (translated subset)
Context Entities
Models
- ChatGPT (GPT-3.5)
- GPT-4

