MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

February 24, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

Links

Abstract / PDF

Why It Matters For Business

If you deploy RAG across multiple languages, MEMERAG gives a native-language testbed to validate whether automatic evaluators match human judgements; use it before trusting LLM judges for quality gating.

Summary TLDR

MEMERAG is a released dataset and benchmark for evaluating retrieval-augmented generation (RAG) outputs in five native languages (EN, DE, ES, FR, HI). The authors collect 1,250 question–context–answer triplets (2,322 annotated sentences) with sentence-level human labels for faithfulness and relevance, show high inter-annotator agreement for coarse labels, and use the data to benchmark automatic evaluators (LLM-as-a-judge). Adding the paper's annotation guidelines to prompts (AG) and combining them with chain-of-thought (COT) raises balanced accuracy of automatic evaluators from ~60% to ~71% on the multilingual faithfulness task. The dataset targets multilingual, native-question evaluation of

Problem Statement

Existing RAG meta-evaluation datasets focus on English or translated data. Translations introduce biases and fail to capture native-language nuances. There is no native multilingual end-to-end meta-evaluation benchmark that includes human judgements of faithfulness and relevance for model-generated answers across languages.

Main Contribution

A native multilingual meta-evaluation benchmark for RAG (MEMERAG) covering five languages: English, German, Spanish, French, Hindi.

Sentence-level human annotations for faithfulness (Supported / Not supported / Challenging) and relevance with high coarse-grained inter-annotator agreement.

Reference baselines showing AG (annotation-guidelines) prompts and AG+COT improve automatic evaluator correlation with human judgements; dataset and code released on GitHub.

Key Findings

High coarse-grained inter-annotator agreement for faithfulness and relevance.

NumbersFaithfulness Gwet's AC1 0.84–0.93; Relevance Gwet's AC1 0.95–1.00 (Table 2)

Dataset size and scope: 1,250 answers and 2,322 annotated sentences across 5 languages.

Numbers#Q=1,250; #S=2,322 (Table 1)

Adding annotation-guidelines (AG) to prompts raises automatic evaluator balanced accuracy.

NumbersAvg BAcc: ZS ~59–66 → AG ~62–72; AG+COT best ~64–72 (Table 5)

Supported (faithful) sentence rates vary by language.

NumbersSupported: EN 65.2%, DE 71.2%, ES 65.7%, FR 62.0%, HI 73.8% (Table 3)

Results

Dataset size (annotated sentences)

Value2,322 sentences; 1,250 answers

Inter-annotator agreement (faithfulness, coarse)

ValueGwet's AC1 0.84–0.93

Accuracy

ValueAG prompts: 62.8–72.6 BAcc; AG+COT: 61.6–71.7 BAcc depending on model

BaselineZero-shot: ~55.4–66.7 (Table 5)

Per-language best evaluator BAcc (AG+COT)

ValueRanges by language e.g., EN ~68–74, DE ~73–76, HI ~72–75

Who Should Care

What To Try In 7 Days

Run a small set of your RAG outputs through MEMERAG to compare your evaluator's BAcc to the baselines.

Add the paper's annotation-guidelines (AG) to your LLM-evaluator prompts and measure BAcc uplift.

Check per-language evaluator performance; do not assume English results transfer to other languages.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only five languages annotated (EN, DE, ES, FR, HI); not exhaustive for global deployment.
  • Questions are native but not parallel across languages, so direct cross-language comparisons are confounded.
  • Annotation subjectivity remains for fine-grained faithfulness labels (lower IAA).
  • Limited set of prompting strategies and LLMs tested; further transfer and fine-tuning not explored.

When Not To Use

  • When you need parallel multilingual test items for strict cross-language comparisons.
  • As a training corpus for large-scale multilingual fine-tuning — dataset is intended for meta-evaluation and analysis, not massive model training.
  • If you require fine-grained label consensus; fine-grained labels show lower agreement.

Failure Modes

  • LLM-as-judge self-preference bias: evaluators may favor their own generations (noted risk).
  • Automatic evaluators struggle most with 'adds new information' and 'nuance shift' errors (Table 13).
  • Prompt sensitivity: zero-shot and naive prompts underperform compared to AG+COT.

Core Entities

Models

  • Claude 3 Sonnet
  • Llama 3 70B
  • Llama 3 8B
  • Mistral 7B
  • GPT-4o mini
  • Qwen 2.5 32B
  • Llama 3.2 11B
  • Llama 3.2 90B

Metrics

  • Accuracy
  • Gwet's AC1
  • Fleiss Kappa

Datasets

  • MIRACL
  • MEMERAG
  • MEMERAG-Ext

Benchmarks

  • MEMERAG (this work)