Overview
Dataset and annotations are high quality for coarse-grained faithfulness; results show consistent prompt effects but cover only five languages and a finite set of LLMs.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
If you deploy RAG across multiple languages, MEMERAG gives a native-language testbed to validate whether automatic evaluators match human judgements; use it before trusting LLM judges for quality gating.
Who Should Care
Summary TLDR
MEMERAG is a released dataset and benchmark for evaluating retrieval-augmented generation (RAG) outputs in five native languages (EN, DE, ES, FR, HI). The authors collect 1,250 question–context–answer triplets (2,322 annotated sentences) with sentence-level human labels for faithfulness and relevance, show high inter-annotator agreement for coarse labels, and use the data to benchmark automatic evaluators (LLM-as-a-judge). Adding the paper's annotation guidelines to prompts (AG) and combining them with chain-of-thought (COT) raises balanced accuracy of automatic evaluators from ~60% to ~71% on the multilingual faithfulness task. The dataset targets multilingual, native-question evaluation of
Problem Statement
Existing RAG meta-evaluation datasets focus on English or translated data. Translations introduce biases and fail to capture native-language nuances. There is no native multilingual end-to-end meta-evaluation benchmark that includes human judgements of faithfulness and relevance for model-generated answers across languages.
Main Contribution
A native multilingual meta-evaluation benchmark for RAG (MEMERAG) covering five languages: English, German, Spanish, French, Hindi.
Sentence-level human annotations for faithfulness (Supported / Not supported / Challenging) and relevance with high coarse-grained inter-annotator agreement.
Key Findings
High coarse-grained inter-annotator agreement for faithfulness and relevance.
Dataset size and scope: 1,250 answers and 2,322 annotated sentences across 5 languages.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size (annotated sentences) | 2,322 sentences; 1,250 answers | — | — | MEMERAG | Table 1; Section 3 | Table 1 |
| Inter-annotator agreement (faithfulness, coarse) | Gwet's AC1 0.84–0.93 | — | — | Per language (EN/DE/ES/FR/HI) | Table 2 reports per-language Gwet's AC1 for faithfulness. | Table 2 |
What To Try In 7 Days
Run a small set of your RAG outputs through MEMERAG to compare your evaluator's BAcc to the baselines.
Add the paper's annotation-guidelines (AG) to your LLM-evaluator prompts and measure BAcc uplift.
Check per-language evaluator performance; do not assume English results transfer to other languages.
Reproducibility
Risks & Boundaries
Limitations
Only five languages annotated (EN, DE, ES, FR, HI); not exhaustive for global deployment.
Questions are native but not parallel across languages, so direct cross-language comparisons are confounded.
When Not To Use
When you need parallel multilingual test items for strict cross-language comparisons.
As a training corpus for large-scale multilingual fine-tuning — dataset is intended for meta-evaluation and analysis, not massive model training.
Failure Modes
LLM-as-judge self-preference bias: evaluators may favor their own generations (noted risk).
Automatic evaluators struggle most with 'adds new information' and 'nuance shift' errors (Table 13).

