Overview
The paper provides hands-on assets (model, data, code) and concrete evaluations; results are solid for NLU but limited for generation and high-stakes use without extra safety work.
Citations2
Evidence Strength0.60
Confidence0.77
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 30%
Why It Matters For Business
Airavata lowers the barrier to building Hindi language assistants by providing an open instruction-tuned model and data; use it for classification and assistant prototypes, but avoid high-stakes production without extra safety and factual checks.
Who Should Care
Summary TLDR
Airavata is an open-source, Hindi instruction-tuned model built by LoRA fine-tuning the OpenHathi foundation model on a curated Hindi/English instruction mix (≈385k filtered examples). The team releases the training mixture, evaluation suite, and code. Airavata improves many Hindi NLU tasks versus the OpenHathi base model, shows mixed results on open-ended generation and translation, and trails closed-source models (GPT-4) on instruction-following and factual quality. The work highlights gaps in cross-lingual transfer and lists toxicity and truthfulness testing.
Problem Statement
Existing LLM progress is English-centric. Hindi LLMs lack instruction-tuned models, diverse training data, and robust evaluation. This paper builds and evaluates a Hindi instruction-tuned model and releases datasets and benchmarks to bootstrap research.
Main Contribution
Release Airavata: a Hindi instruction-tuned model built by LoRA finetuning OpenHathi.
Publish an instruction-tuning mixture (≈404k raw, 385k filtered Hindi/English examples) and two native Hindi instruction datasets (wikiHow, Anudesh).
Key Findings
Instruction tuning substantially improves Hindi NLU on several benchmarks versus base OpenHathi.
Airavata improves sentiment and many NLU tasks but not translation or open-ended generation.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| IndicSentiment (0-shot) | Airavata 95.81 | OpenHathi 72.89 | +22.92 | IndicSentiment | Table 3: F1 | Table 3 |
| IndicXNLI (0-shot) | Airavata 73.26 | OpenHathi 16.67 | +56.59 | IndicXNLI | Table 3: F1 | Table 3 |
What To Try In 7 Days
Download Airavata and run zero-shot classification on Hindi customer intents to compare vs existing pipelines.
Use the released IndicInstruct dataset to fine-tune or LoRA-adapt your base model for specific Hindi tasks.
Run the provided evaluation suite (IndicXTREME + toxicity tests) on your own models to identify Hindi weaknesses.
Optimization Features
Token Efficiency
Model Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Prone to hallucinations and factual errors; TruthfulQA scores remain low.
Open-ended generation and some NLG tasks lag behind stronger models.
When Not To Use
High-stakes or production systems without additional verification.
Tasks requiring high-quality machine translation or open-ended creative generation.
Failure Modes
Hallucinated facts or invented details in responses.
Weak or inconsistent open-ended generation quality.

