Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you build search, QA, or assistant features for Indian users, IndicRAGSuite provides both a standard test (IndicMSMARCO) and large training data (~14M triplets) to reduce development time and improve retrieval in many Indian languages.
Summary TLDR
This paper builds two core resources to enable Retrieval-Augmented Generation (RAG) in Indian languages: (1) IndicMSMARCO, a human-verified multilingual benchmark of 1,000 MS MARCO queries translated into 13 Indian languages for retrieval and generation evaluation; and (2) a large training corpus of roughly 14 million (question, answer, reasoning, passage) triplets derived from Wikipedia across 19 Indian languages plus paragraph-level translations of MS MARCO train/dev into 14 Indian languages. Baselines show modern dense retrievers (BGE-M3, multilingual e5-large) reach MRR ≈0.50 on several languages, but low-resource languages lag. Datasets and benchmark aim to standardize evaluation and to
Problem Statement
Indian languages lack both standardized benchmarks and large-scale multilingual training data for dense retrieval and RAG. Existing resources are English-centric or cover only a few Indian languages, causing poor retrieval performance and slow progress for Indian-language RAG systems.
Main Contribution
IndicMSMARCO: a human-verified multilingual retrieval benchmark (1,000 queries) across 13 Indian languages, created by LLaMA 3.3 70B translation followed by expert post-editing.
Large Wikipedia-based training corpus: about 14 million question-answer-reasoning triplets across 19 Indian languages generated by LLaMA 3.3 70B from paragraph-level Wikipedia, filtered for length and quality.
Paragraph-level translated MS MARCO: translation of MS MARCO train/dev into 14 Indian languages using IndicTrans3-beta to preserve context and search intent for supervised retriever training.
Key Findings
IndicMSMARCO provides a high-quality multilingual benchmark of real queries.
A large synthetic training corpus was produced from Wikipedia.
Modern dense retrievers reach about 0.50 MRR on several Indian languages.
Results
MRR (Hindi)
MRR (Telugu)
MRR (Malayalam / Tamil)
Who Should Care
What To Try In 7 Days
Run BGE-M3 and multilingual e5-large on IndicMSMARCO to replicate baseline MRR and find language gaps.
Fine-tune a dense retriever on a small slice of the Wikipedia triplets for a target language and compare MRR before/after.
Replace sentence-level translated data with the paper's paragraph-level translated MS MARCO examples and measure retrieval fidelity.
Agent Features
Memory
- retrieval memory (paragraph-grounded passages)
Tool Use
- LLMs for translation and synthetic data generation
Architectures
- dense retrieval
Optimization Features
Training Optimization
- paragraph-level generation to retain context
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Machine translation and LLM-generated triplets can introduce semantic drift or errors; authors mitigate this with human post-editing but residual issues may remain.
- Low-resource languages (e.g., Assamese, Odia) still show substantially lower MRR, indicating dataset scale or Wikipedia coverage limits.
- No public code or explicit URLs in the paper for reproducing the entire pipeline are provided in the text.
When Not To Use
- When strict human-verified ground truth is required for high-stakes domains (medical, legal) without further human validation of synthetic data.
- If target language lacks sufficient Wikipedia coverage—synthetic triplets may be noisy or sparse.
Failure Modes
- Hallucinated or ungrounded QA pairs from LLM generation harming retriever behavior.
- Translation artifacts that change query intent and degrade retrieval evaluation.
- Overfitting to Wikipedia-style language and reduced performance on web search queries.
Core Entities
Models
- LLaMA 3.3 70B
- LLaMA 3.1 8B Instruct
- Multilingual e5-small
- Multilingual e5-base
- Multilingual e5-large
- BGE-M3
- Llama2? (mentioned generically)
Metrics
- MRR
Datasets
- MS MARCO
- IndicMSMARCO
- Wikipedia-based triplet corpus (IndicRAGSuite)
- Translated MS MARCO (14 languages)
- INDIC-MARCO (prior work)
Benchmarks
- IndicMSMARCO
- MS MARCO
Context Entities
Models
- mDPR
- mContriever
- mE5
- text-embedding-ada-002
Metrics
- BLEU (for translation quality mentioned)
- MRR (primary retrieval metric)
Datasets
- NQ
- SQuAD
- TriviaQA
- BEIR
- MKQA
- TyDi QA
- MIRACL

