MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Overview

Decision SnapshotReady For Pilot

Dataset and annotations are high quality for coarse-grained faithfulness; results show consistent prompt effects but cover only five languages and a finite set of LLMs.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy RAG across multiple languages, MEMERAG gives a native-language testbed to validate whether automatic evaluators match human judgements; use it before trusting LLM judges for quality gating.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

MEMERAG is a released dataset and benchmark for evaluating retrieval-augmented generation (RAG) outputs in five native languages (EN, DE, ES, FR, HI). The authors collect 1,250 question–context–answer triplets (2,322 annotated sentences) with sentence-level human labels for faithfulness and relevance, show high inter-annotator agreement for coarse labels, and use the data to benchmark automatic evaluators (LLM-as-a-judge). Adding the paper's annotation guidelines to prompts (AG) and combining them with chain-of-thought (COT) raises balanced accuracy of automatic evaluators from ~60% to ~71% on the multilingual faithfulness task. The dataset targets multilingual, native-question evaluation of

Problem Statement

Existing RAG meta-evaluation datasets focus on English or translated data. Translations introduce biases and fail to capture native-language nuances. There is no native multilingual end-to-end meta-evaluation benchmark that includes human judgements of faithfulness and relevance for model-generated answers across languages.

Main Contribution

A native multilingual meta-evaluation benchmark for RAG (MEMERAG) covering five languages: English, German, Spanish, French, Hindi.

Sentence-level human annotations for faithfulness (Supported / Not supported / Challenging) and relevance with high coarse-grained inter-annotator agreement.

Key Findings

High coarse-grained inter-annotator agreement for faithfulness and relevance.

NumbersFaithfulness Gwet's AC1 0.84–0.93; Relevance Gwet's AC1 0.95–1.00 (Table 2)

Practical UseYou can trust sentence-level Supported/Not supported labels for meta-evaluation; use them to benchmark automatic evaluators.

Evidence RefTable 2

Dataset size and scope: 1,250 answers and 2,322 annotated sentences across 5 languages.

Numbers#Q=1,250; #S=2,322 (Table 1)

Practical UseEnough data to test multilingual evaluator prompts and models, but not a massive training corpus for large-scale fine-tuning.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size (annotated sentences)	2,322 sentences; 1,250 answers	—	—	MEMERAG	Table 1; Section 3	Table 1
Inter-annotator agreement (faithfulness, coarse)	Gwet's AC1 0.84–0.93	—	—	Per language (EN/DE/ES/FR/HI)	Table 2 reports per-language Gwet's AC1 for faithfulness.	Table 2

What To Try In 7 Days

Run a small set of your RAG outputs through MEMERAG to compare your evaluator's BAcc to the baselines.

Add the paper's annotation-guidelines (AG) to your LLM-evaluator prompts and measure BAcc uplift.

Check per-language evaluator performance; do not assume English results transfer to other languages.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/amazon-science/MEMERAG

Data URLs

https://github.com/amazon-science/MEMERAG

Risks & Boundaries

Limitations

Only five languages annotated (EN, DE, ES, FR, HI); not exhaustive for global deployment.

Questions are native but not parallel across languages, so direct cross-language comparisons are confounded.

When Not To Use

When you need parallel multilingual test items for strict cross-language comparisons.

As a training corpus for large-scale multilingual fine-tuning — dataset is intended for meta-evaluation and analysis, not massive model training.

Failure Modes

LLM-as-judge self-preference bias: evaluators may favor their own generations (noted risk).

Automatic evaluators struggle most with 'adds new information' and 'nuance shift' errors (Table 13).

Core Entities

Models

Claude 3 SonnetLlama 3 70BLlama 3 8BMistral 7BGPT-4o miniQwen 2.5 32BLlama 3.2 11BLlama 3.2 90B

Metrics

AccuracyGwet's AC1Fleiss Kappa

Datasets

MIRACLMEMERAGMEMERAG-Ext

Benchmarks

MEMERAG (this work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High coarse-grained inter-annotator agreement for faithfulness and relevance.

Dataset size and scope: 1,250 answers and 2,322 annotated sentences across 5 languages.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding