IndicRAGSuite: a 13-language retrieval benchmark plus ~14M synthetic QA triplets for Indian-language RAG

June 2, 20256 min

Overview

Decision SnapshotNeeds Validation

The datasets and benchmark are practical and immediately usable; baseline evaluations show real gains, but synthetic data and LLM translation introduce quality risks that require validation.

Citations0

Evidence Strength0.70

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Pasunuti Prasanjith, Prathmesh B More, Anoop Kunchukuttan, Raj Dabre

Links

Abstract / PDF

Why It Matters For Business

If you build search, QA, or assistant features for Indian users, IndicRAGSuite provides both a standard test (IndicMSMARCO) and large training data (~14M triplets) to reduce development time and improve retrieval in many Indian languages.

Who Should Care

Summary TLDR

This paper builds two core resources to enable Retrieval-Augmented Generation (RAG) in Indian languages: (1) IndicMSMARCO, a human-verified multilingual benchmark of 1,000 MS MARCO queries translated into 13 Indian languages for retrieval and generation evaluation; and (2) a large training corpus of roughly 14 million (question, answer, reasoning, passage) triplets derived from Wikipedia across 19 Indian languages plus paragraph-level translations of MS MARCO train/dev into 14 Indian languages. Baselines show modern dense retrievers (BGE-M3, multilingual e5-large) reach MRR ≈0.50 on several languages, but low-resource languages lag. Datasets and benchmark aim to standardize evaluation and to

Problem Statement

Indian languages lack both standardized benchmarks and large-scale multilingual training data for dense retrieval and RAG. Existing resources are English-centric or cover only a few Indian languages, causing poor retrieval performance and slow progress for Indian-language RAG systems.

Main Contribution

IndicMSMARCO: a human-verified multilingual retrieval benchmark (1,000 queries) across 13 Indian languages, created by LLaMA 3.3 70B translation followed by expert post-editing.

Large Wikipedia-based training corpus: about 14 million question-answer-reasoning triplets across 19 Indian languages generated by LLaMA 3.3 70B from paragraph-level Wikipedia, filtered for length and quality.

Key Findings

IndicMSMARCO provides a high-quality multilingual benchmark of real queries.

Numbers1000 queries; 13 languages

Practical UseUse IndicMSMARCO for standardized, language-specific evaluation of retrievers and RAG systems instead of ad-hoc or English-only tests.

Evidence RefSection 3 / Abstract

A large synthetic training corpus was produced from Wikipedia.

Numbers≈14M triplets across 19 languages

Practical UseTrain or pretrain dense retrievers on this corpus to improve multilingual coverage and reduce data scarcity for many Indian languages.

Evidence RefSection 4.1.5 / Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MRR (Hindi)0.52IndicMSMARCOe5-large and BGE-M3 reach 0.52 on HindiTable 2
MRR (Telugu)0.50IndicMSMARCOBGE-M3 reaches 0.50 on TeluguTable 2

What To Try In 7 Days

Run BGE-M3 and multilingual e5-large on IndicMSMARCO to replicate baseline MRR and find language gaps.

Fine-tune a dense retriever on a small slice of the Wikipedia triplets for a target language and compare MRR before/after.

Replace sentence-level translated data with the paper's paragraph-level translated MS MARCO examples and measure retrieval fidelity.

Agent Features

Memory
retrieval memory (paragraph-grounded passages)
Tool Use
LLMs for translation and synthetic data generation
Architectures
dense retrieval

Optimization Features

Training Optimization
paragraph-level generation to retain context

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Machine translation and LLM-generated triplets can introduce semantic drift or errors; authors mitigate this with human post-editing but residual issues may remain.

Low-resource languages (e.g., Assamese, Odia) still show substantially lower MRR, indicating dataset scale or Wikipedia coverage limits.

When Not To Use

When strict human-verified ground truth is required for high-stakes domains (medical, legal) without further human validation of synthetic data.

If target language lacks sufficient Wikipedia coverage—synthetic triplets may be noisy or sparse.

Failure Modes

Hallucinated or ungrounded QA pairs from LLM generation harming retriever behavior.

Translation artifacts that change query intent and degrade retrieval evaluation.

Core Entities

Models

LLaMA 3.3 70BLLaMA 3.1 8B InstructMultilingual e5-smallMultilingual e5-baseMultilingual e5-largeBGE-M3Llama2? (mentioned generically)

Metrics

MRR

Datasets

MS MARCOIndicMSMARCOWikipedia-based triplet corpus (IndicRAGSuite)Translated MS MARCO (14 languages)INDIC-MARCO (prior work)

Benchmarks

IndicMSMARCOMS MARCO

Context Entities

Models

mDPRmContrievermE5text-embedding-ada-002

Metrics

BLEU (for translation quality mentioned)MRR (primary retrieval metric)

Datasets

NQSQuADTriviaQABEIRMKQATyDi QAMIRACL