MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

February 24, 20257 min

Overview

Decision SnapshotReady For Pilot

Dataset and annotations are high quality for coarse-grained faithfulness; results show consistent prompt effects but cover only five languages and a finite set of LLMs.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy RAG across multiple languages, MEMERAG gives a native-language testbed to validate whether automatic evaluators match human judgements; use it before trusting LLM judges for quality gating.

Who Should Care

Summary TLDR

MEMERAG is a released dataset and benchmark for evaluating retrieval-augmented generation (RAG) outputs in five native languages (EN, DE, ES, FR, HI). The authors collect 1,250 question–context–answer triplets (2,322 annotated sentences) with sentence-level human labels for faithfulness and relevance, show high inter-annotator agreement for coarse labels, and use the data to benchmark automatic evaluators (LLM-as-a-judge). Adding the paper's annotation guidelines to prompts (AG) and combining them with chain-of-thought (COT) raises balanced accuracy of automatic evaluators from ~60% to ~71% on the multilingual faithfulness task. The dataset targets multilingual, native-question evaluation of

Problem Statement

Existing RAG meta-evaluation datasets focus on English or translated data. Translations introduce biases and fail to capture native-language nuances. There is no native multilingual end-to-end meta-evaluation benchmark that includes human judgements of faithfulness and relevance for model-generated answers across languages.

Main Contribution

A native multilingual meta-evaluation benchmark for RAG (MEMERAG) covering five languages: English, German, Spanish, French, Hindi.

Sentence-level human annotations for faithfulness (Supported / Not supported / Challenging) and relevance with high coarse-grained inter-annotator agreement.

Key Findings

High coarse-grained inter-annotator agreement for faithfulness and relevance.

NumbersFaithfulness Gwet's AC1 0.840.93; Relevance Gwet's AC1 0.951.00 (Table 2)

Practical UseYou can trust sentence-level Supported/Not supported labels for meta-evaluation; use them to benchmark automatic evaluators.

Evidence RefTable 2

Dataset size and scope: 1,250 answers and 2,322 annotated sentences across 5 languages.

Numbers#Q=1,250; #S=2,322 (Table 1)

Practical UseEnough data to test multilingual evaluator prompts and models, but not a massive training corpus for large-scale fine-tuning.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size (annotated sentences)2,322 sentences; 1,250 answersMEMERAGTable 1; Section 3Table 1
Inter-annotator agreement (faithfulness, coarse)Gwet's AC1 0.840.93Per language (EN/DE/ES/FR/HI)Table 2 reports per-language Gwet's AC1 for faithfulness.Table 2

What To Try In 7 Days

Run a small set of your RAG outputs through MEMERAG to compare your evaluator's BAcc to the baselines.

Add the paper's annotation-guidelines (AG) to your LLM-evaluator prompts and measure BAcc uplift.

Check per-language evaluator performance; do not assume English results transfer to other languages.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only five languages annotated (EN, DE, ES, FR, HI); not exhaustive for global deployment.

Questions are native but not parallel across languages, so direct cross-language comparisons are confounded.

When Not To Use

When you need parallel multilingual test items for strict cross-language comparisons.

As a training corpus for large-scale multilingual fine-tuning — dataset is intended for meta-evaluation and analysis, not massive model training.

Failure Modes

LLM-as-judge self-preference bias: evaluators may favor their own generations (noted risk).

Automatic evaluators struggle most with 'adds new information' and 'nuance shift' errors (Table 13).

Core Entities

Models

Claude 3 SonnetLlama 3 70BLlama 3 8BMistral 7BGPT-4o miniQwen 2.5 32BLlama 3.2 11BLlama 3.2 90B

Metrics

AccuracyGwet's AC1Fleiss Kappa

Datasets

MIRACLMEMERAGMEMERAG-Ext

Benchmarks

MEMERAG (this work)