IndicRAGSuite: a 13-language retrieval benchmark plus ~14M synthetic QA triplets for Indian-language RAG

June 2, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Pasunuti Prasanjith, Prathmesh B More, Anoop Kunchukuttan, Raj Dabre

Links

Abstract / PDF

Why It Matters For Business

If you build search, QA, or assistant features for Indian users, IndicRAGSuite provides both a standard test (IndicMSMARCO) and large training data (~14M triplets) to reduce development time and improve retrieval in many Indian languages.

Summary TLDR

This paper builds two core resources to enable Retrieval-Augmented Generation (RAG) in Indian languages: (1) IndicMSMARCO, a human-verified multilingual benchmark of 1,000 MS MARCO queries translated into 13 Indian languages for retrieval and generation evaluation; and (2) a large training corpus of roughly 14 million (question, answer, reasoning, passage) triplets derived from Wikipedia across 19 Indian languages plus paragraph-level translations of MS MARCO train/dev into 14 Indian languages. Baselines show modern dense retrievers (BGE-M3, multilingual e5-large) reach MRR ≈0.50 on several languages, but low-resource languages lag. Datasets and benchmark aim to standardize evaluation and to

Problem Statement

Indian languages lack both standardized benchmarks and large-scale multilingual training data for dense retrieval and RAG. Existing resources are English-centric or cover only a few Indian languages, causing poor retrieval performance and slow progress for Indian-language RAG systems.

Main Contribution

IndicMSMARCO: a human-verified multilingual retrieval benchmark (1,000 queries) across 13 Indian languages, created by LLaMA 3.3 70B translation followed by expert post-editing.

Large Wikipedia-based training corpus: about 14 million question-answer-reasoning triplets across 19 Indian languages generated by LLaMA 3.3 70B from paragraph-level Wikipedia, filtered for length and quality.

Paragraph-level translated MS MARCO: translation of MS MARCO train/dev into 14 Indian languages using IndicTrans3-beta to preserve context and search intent for supervised retriever training.

Key Findings

IndicMSMARCO provides a high-quality multilingual benchmark of real queries.

Numbers1000 queries; 13 languages

A large synthetic training corpus was produced from Wikipedia.

Numbers≈14M triplets across 19 languages

Modern dense retrievers reach about 0.50 MRR on several Indian languages.

NumbersMRR up to 0.52 (Hindi), 0.50 (Telugu), 0.49 (Malayalam/Tamil)

Results

MRR (Hindi)

Value0.52

MRR (Telugu)

Value0.50

MRR (Malayalam / Tamil)

Value0.49

Who Should Care

What To Try In 7 Days

Run BGE-M3 and multilingual e5-large on IndicMSMARCO to replicate baseline MRR and find language gaps.

Fine-tune a dense retriever on a small slice of the Wikipedia triplets for a target language and compare MRR before/after.

Replace sentence-level translated data with the paper's paragraph-level translated MS MARCO examples and measure retrieval fidelity.

Agent Features

Memory

  • retrieval memory (paragraph-grounded passages)

Tool Use

  • LLMs for translation and synthetic data generation

Architectures

  • dense retrieval

Optimization Features

Training Optimization

  • paragraph-level generation to retain context

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Machine translation and LLM-generated triplets can introduce semantic drift or errors; authors mitigate this with human post-editing but residual issues may remain.
  • Low-resource languages (e.g., Assamese, Odia) still show substantially lower MRR, indicating dataset scale or Wikipedia coverage limits.
  • No public code or explicit URLs in the paper for reproducing the entire pipeline are provided in the text.

When Not To Use

  • When strict human-verified ground truth is required for high-stakes domains (medical, legal) without further human validation of synthetic data.
  • If target language lacks sufficient Wikipedia coverage—synthetic triplets may be noisy or sparse.

Failure Modes

  • Hallucinated or ungrounded QA pairs from LLM generation harming retriever behavior.
  • Translation artifacts that change query intent and degrade retrieval evaluation.
  • Overfitting to Wikipedia-style language and reduced performance on web search queries.

Core Entities

Models

  • LLaMA 3.3 70B
  • LLaMA 3.1 8B Instruct
  • Multilingual e5-small
  • Multilingual e5-base
  • Multilingual e5-large
  • BGE-M3
  • Llama2? (mentioned generically)

Metrics

  • MRR

Datasets

  • MS MARCO
  • IndicMSMARCO
  • Wikipedia-based triplet corpus (IndicRAGSuite)
  • Translated MS MARCO (14 languages)
  • INDIC-MARCO (prior work)

Benchmarks

  • IndicMSMARCO
  • MS MARCO

Context Entities

Models

  • mDPR
  • mContriever
  • mE5
  • text-embedding-ada-002

Metrics

  • BLEU (for translation quality mentioned)
  • MRR (primary retrieval metric)

Datasets

  • NQ
  • SQuAD
  • TriviaQA
  • BEIR
  • MKQA
  • TyDi QA
  • MIRACL