Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you deploy RAG across multiple languages, MEMERAG gives a native-language testbed to validate whether automatic evaluators match human judgements; use it before trusting LLM judges for quality gating.
Summary TLDR
MEMERAG is a released dataset and benchmark for evaluating retrieval-augmented generation (RAG) outputs in five native languages (EN, DE, ES, FR, HI). The authors collect 1,250 question–context–answer triplets (2,322 annotated sentences) with sentence-level human labels for faithfulness and relevance, show high inter-annotator agreement for coarse labels, and use the data to benchmark automatic evaluators (LLM-as-a-judge). Adding the paper's annotation guidelines to prompts (AG) and combining them with chain-of-thought (COT) raises balanced accuracy of automatic evaluators from ~60% to ~71% on the multilingual faithfulness task. The dataset targets multilingual, native-question evaluation of
Problem Statement
Existing RAG meta-evaluation datasets focus on English or translated data. Translations introduce biases and fail to capture native-language nuances. There is no native multilingual end-to-end meta-evaluation benchmark that includes human judgements of faithfulness and relevance for model-generated answers across languages.
Main Contribution
A native multilingual meta-evaluation benchmark for RAG (MEMERAG) covering five languages: English, German, Spanish, French, Hindi.
Sentence-level human annotations for faithfulness (Supported / Not supported / Challenging) and relevance with high coarse-grained inter-annotator agreement.
Reference baselines showing AG (annotation-guidelines) prompts and AG+COT improve automatic evaluator correlation with human judgements; dataset and code released on GitHub.
Key Findings
High coarse-grained inter-annotator agreement for faithfulness and relevance.
Dataset size and scope: 1,250 answers and 2,322 annotated sentences across 5 languages.
Adding annotation-guidelines (AG) to prompts raises automatic evaluator balanced accuracy.
Supported (faithful) sentence rates vary by language.
Results
Dataset size (annotated sentences)
Inter-annotator agreement (faithfulness, coarse)
Accuracy
Per-language best evaluator BAcc (AG+COT)
Who Should Care
What To Try In 7 Days
Run a small set of your RAG outputs through MEMERAG to compare your evaluator's BAcc to the baselines.
Add the paper's annotation-guidelines (AG) to your LLM-evaluator prompts and measure BAcc uplift.
Check per-language evaluator performance; do not assume English results transfer to other languages.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only five languages annotated (EN, DE, ES, FR, HI); not exhaustive for global deployment.
- Questions are native but not parallel across languages, so direct cross-language comparisons are confounded.
- Annotation subjectivity remains for fine-grained faithfulness labels (lower IAA).
- Limited set of prompting strategies and LLMs tested; further transfer and fine-tuning not explored.
When Not To Use
- When you need parallel multilingual test items for strict cross-language comparisons.
- As a training corpus for large-scale multilingual fine-tuning — dataset is intended for meta-evaluation and analysis, not massive model training.
- If you require fine-grained label consensus; fine-grained labels show lower agreement.
Failure Modes
- LLM-as-judge self-preference bias: evaluators may favor their own generations (noted risk).
- Automatic evaluators struggle most with 'adds new information' and 'nuance shift' errors (Table 13).
- Prompt sensitivity: zero-shot and naive prompts underperform compared to AG+COT.
Core Entities
Models
- Claude 3 Sonnet
- Llama 3 70B
- Llama 3 8B
- Mistral 7B
- GPT-4o mini
- Qwen 2.5 32B
- Llama 3.2 11B
- Llama 3.2 90B
Metrics
- Accuracy
- Gwet's AC1
- Fleiss Kappa
Datasets
- MIRACL
- MEMERAG
- MEMERAG-Ext
Benchmarks
- MEMERAG (this work)

