IndicRAGSuite: a 13-language retrieval benchmark plus ~14M synthetic QA triplets for Indian-language RAG

Overview

Decision SnapshotNeeds Validation

The datasets and benchmark are practical and immediately usable; baseline evaluations show real gains, but synthetic data and LLM translation introduce quality risks that require validation.

Citations0

Evidence Strength0.70

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Pasunuti Prasanjith, Prathmesh B More, Anoop Kunchukuttan, Raj Dabre

Links

Abstract / PDF

Why It Matters For Business

If you build search, QA, or assistant features for Indian users, IndicRAGSuite provides both a standard test (IndicMSMARCO) and large training data (~14M triplets) to reduce development time and improve retrieval in many Indian languages.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Engineering Lead

Summary TLDR

This paper builds two core resources to enable Retrieval-Augmented Generation (RAG) in Indian languages: (1) IndicMSMARCO, a human-verified multilingual benchmark of 1,000 MS MARCO queries translated into 13 Indian languages for retrieval and generation evaluation; and (2) a large training corpus of roughly 14 million (question, answer, reasoning, passage) triplets derived from Wikipedia across 19 Indian languages plus paragraph-level translations of MS MARCO train/dev into 14 Indian languages. Baselines show modern dense retrievers (BGE-M3, multilingual e5-large) reach MRR ≈0.50 on several languages, but low-resource languages lag. Datasets and benchmark aim to standardize evaluation and to

Problem Statement

Indian languages lack both standardized benchmarks and large-scale multilingual training data for dense retrieval and RAG. Existing resources are English-centric or cover only a few Indian languages, causing poor retrieval performance and slow progress for Indian-language RAG systems.

Main Contribution

IndicMSMARCO: a human-verified multilingual retrieval benchmark (1,000 queries) across 13 Indian languages, created by LLaMA 3.3 70B translation followed by expert post-editing.

Large Wikipedia-based training corpus: about 14 million question-answer-reasoning triplets across 19 Indian languages generated by LLaMA 3.3 70B from paragraph-level Wikipedia, filtered for length and quality.

Key Findings

IndicMSMARCO provides a high-quality multilingual benchmark of real queries.

Numbers1000 queries; 13 languages

Practical UseUse IndicMSMARCO for standardized, language-specific evaluation of retrievers and RAG systems instead of ad-hoc or English-only tests.

Evidence RefSection 3 / Abstract

A large synthetic training corpus was produced from Wikipedia.

Numbers≈14M triplets across 19 languages

Practical UseTrain or pretrain dense retrievers on this corpus to improve multilingual coverage and reduce data scarcity for many Indian languages.

Evidence RefSection 4.1.5 / Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MRR (Hindi)	0.52	—	—	IndicMSMARCO	e5-large and BGE-M3 reach 0.52 on Hindi	Table 2
MRR (Telugu)	0.50	—	—	IndicMSMARCO	BGE-M3 reaches 0.50 on Telugu	Table 2

What To Try In 7 Days

Run BGE-M3 and multilingual e5-large on IndicMSMARCO to replicate baseline MRR and find language gaps.

Fine-tune a dense retriever on a small slice of the Wikipedia triplets for a target language and compare MRR before/after.

Replace sentence-level translated data with the paper's paragraph-level translated MS MARCO examples and measure retrieval fidelity.

Agent Features

Memory

retrieval memory (paragraph-grounded passages)

Tool Use

LLMs for translation and synthetic data generation

Architectures

dense retrieval

Optimization Features

Training Optimization

paragraph-level generation to retain context

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Machine translation and LLM-generated triplets can introduce semantic drift or errors; authors mitigate this with human post-editing but residual issues may remain.

Low-resource languages (e.g., Assamese, Odia) still show substantially lower MRR, indicating dataset scale or Wikipedia coverage limits.

When Not To Use

When strict human-verified ground truth is required for high-stakes domains (medical, legal) without further human validation of synthetic data.

If target language lacks sufficient Wikipedia coverage—synthetic triplets may be noisy or sparse.

Failure Modes

Hallucinated or ungrounded QA pairs from LLM generation harming retriever behavior.

Translation artifacts that change query intent and degrade retrieval evaluation.

Core Entities

Models

LLaMA 3.3 70BLLaMA 3.1 8B InstructMultilingual e5-smallMultilingual e5-baseMultilingual e5-largeBGE-M3Llama2? (mentioned generically)

Metrics

MRR

Datasets

MS MARCOIndicMSMARCOWikipedia-based triplet corpus (IndicRAGSuite)Translated MS MARCO (14 languages)INDIC-MARCO (prior work)

Benchmarks

IndicMSMARCOMS MARCO

Context Entities

Models

mDPRmContrievermE5text-embedding-ada-002

Metrics

BLEU (for translation quality mentioned)MRR (primary retrieval metric)

Datasets

NQSQuADTriviaQABEIRMKQATyDi QAMIRACL

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

IndicMSMARCO provides a high-quality multilingual benchmark of real queries.

A large synthetic training corpus was produced from Wikipedia.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

Key finding

DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

Key finding

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

Key finding

Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

Key finding

A public end-to-end benchmark showing retrieval quality—not the LLM—mostly determines legal RAG performance

Key finding