Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
RAGElo cuts expert labeling cost by using synthetic queries and LLM judges to rank retrieval-augmented systems, so teams can iterate and pick retrieval or fusion strategies faster while keeping a small human calibration step.
Summary TLDR
RAGElo is an open-source toolkit that automates evaluation of retrieval-augmented QA systems for private corpora. It builds a synthetic test set by prompting LLMs on document passages, uses a strong LLM as a pairwise judge that sees retrieved documents, and ranks systems via Elo-style tournaments. On Infineon product documents, LLM-judged rankings moderately agree with experts (Kendall τ≈0.56). RAG-Fusion (query variation + reciprocal rank fusion) often wins higher Elo and improves answer completeness but reduces precision; BM25 retrieval outperformed off-the-shelf embeddings in these experiments. Use RAGElo for fast, repeatable system comparisons, not as a drop-in replacement for expert QA.
Problem Statement
Enterprise RAG systems need repeatable, low-cost evaluation but lack large gold-standard QA sets and expert annotations. Standard n-gram metrics fail without reference answers. The paper asks whether synthetic queries plus LLM-as-judge and Elo tournaments can rank RAG variants reliably and whether RAG-Fusion gives better answers.
Main Contribution
RAGElo toolkit: automates retrieval evaluation, pairwise LLM judging, and Elo-style ranking for RAG systems.
A synthetic test-set pipeline: generate evaluation queries by prompting LLMs on long document passages with few-shot real queries.
Empirical comparison on Infineon product docs showing RAG-Fusion yields higher Elo and greater completeness but lower precision versus standard RAG.
Evidence that LLM-as-a-judge moderately aligns with domain experts (statistical correlations and Bland-Altman analysis).
Practical guidance: BM25 beat off-the-shelf embeddings for this domain; rank fusion improved retrieval ranks.
Key Findings
LLM-as-a-judge moderately matches human experts.
RAG-Fusion achieved higher Elo ranking than RAG on these queries.
RAG-Fusion improved completeness but reduced precision versus RAG (expert judgments).
BM25 outperformed off-the-shelf vector embeddings on retrieval for this domain.
LLM judge scores show a small positive bias and wide individual differences vs humans.
Results
MRR@5 (very relevant)
Elo score (averaged over 500 tournaments)
Pairwise win % (BM25)
Judge vs expert agreement
Who Should Care
What To Try In 7 Days
Run RAGElo on a small slice of your internal docs to compare BM25 vs your current embedding retriever.
Generate synthetic evaluation queries by prompting an LLM on representative document passages with a few real queries as examples.
Run a quick RAGElo tournament between your baseline RAG and a RAG-Fusion variant to check completeness vs precision trade-offs.
Agent Features
Planning
- query-variation generation for retrieval (RAG-Fusion)
Tool Use
- LLM-as-a-judge for pairwise comparisons
- reciprocal rank fusion (RRF) to combine rankings
Frameworks
- RAGElo (Elo-based tournament for RAG evaluations)
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LLM-as-a-judge shows only moderate agreement with experts and small positive bias.
- Experiments run on a single internal product corpus; results may not generalize.
- Synthetic queries depend on prompt design and injected passages; they may not match live user behavior.
When Not To Use
- When you require gold-standard, human-verified reference answers for compliance or legal checks.
- When the judge LLM cannot access the same documents or context as the systems being evaluated.
- When strict product-level precision is mandatory and cannot be compensated by completeness.
Failure Modes
- Judge LLM hallucinates or misses domain facts despite seeing documents.
- RAG-Fusion produces comprehensive but imprecise answers that confuse downstream users.
- Embedding model mismatch yields poor KNN retrieval, misleading comparisons.
Core Entities
Models
- gpt-4-turbo
- gpt-4o
- Claude 3 Opus
- Claude 3 Sonnet
- Claude 3 Haiku
- multilingual-e5-base (embeddings)
Metrics
- MRR@5
- Elo score
- Pairwise win %
- Kendall τ
- Spearman ρ
- Bland-Altman bias/pair limits
- p-values (paired t-tests)
Datasets
- Infineon XENSIV Product Selection Guide (117-page corpus)
- Synthetic query pool N=840 (sampled 200 for eval)
Context Entities
Models
- GPT-4 turbo (used to generate synthetic queries and judge in some configs)
- Anthropic Claude 3 family (query generation)
Datasets
- User query examples from Infineon (23 seed queries used as few-shot prompts)

