RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

June 20, 20249 min

Overview

Decision SnapshotNeeds Validation

RAGElo is a practical, usable toolkit for comparative evaluation; evidence is moderate because experiments use a single enterprise corpus and LLM-judge agreement is only moderate, so expect to calibrate with human labels before production rollout.

Citations3

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zackary Rackauckas, Arthur Câmara, Jakub Zavrel

Links

Abstract / PDF / Code

Why It Matters For Business

RAGElo cuts expert labeling cost by using synthetic queries and LLM judges to rank retrieval-augmented systems, so teams can iterate and pick retrieval or fusion strategies faster while keeping a small human calibration step.

Who Should Care

Summary TLDR

RAGElo is an open-source toolkit that automates evaluation of retrieval-augmented QA systems for private corpora. It builds a synthetic test set by prompting LLMs on document passages, uses a strong LLM as a pairwise judge that sees retrieved documents, and ranks systems via Elo-style tournaments. On Infineon product documents, LLM-judged rankings moderately agree with experts (Kendall τ≈0.56). RAG-Fusion (query variation + reciprocal rank fusion) often wins higher Elo and improves answer completeness but reduces precision; BM25 retrieval outperformed off-the-shelf embeddings in these experiments. Use RAGElo for fast, repeatable system comparisons, not as a drop-in replacement for expert QA.

Problem Statement

Enterprise RAG systems need repeatable, low-cost evaluation but lack large gold-standard QA sets and expert annotations. Standard n-gram metrics fail without reference answers. The paper asks whether synthetic queries plus LLM-as-judge and Elo tournaments can rank RAG variants reliably and whether RAG-Fusion gives better answers.

Main Contribution

RAGElo toolkit: automates retrieval evaluation, pairwise LLM judging, and Elo-style ranking for RAG systems.

A synthetic test-set pipeline: generate evaluation queries by prompting LLMs on long document passages with few-shot real queries.

Key Findings

LLM-as-a-judge moderately matches human experts.

NumbersKendall τ ≈ 0.56, p < 0.01; Spearman ρ ≈ 0.59

Practical UseLLM judges can speed up relative system comparisons, but individual scores are noisy—calibrate with some expert labels before trusting absolute scores.

Evidence RefSection 7.1; Figure 4

RAG-Fusion achieved higher Elo ranking than RAG on these queries.

NumbersElo: RAGF+BM25 = 571 vs RAG+BM25 = 487 (Table 6)

Practical UseIf you want a quick ranking of variants, run RAG-Fusion with rank fusion; expect better overall wins on similar engineering-document tasks.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MRR@5 (very relevant)RAGF BM25 = 0.855; RAG BM25 = 0.821RAG BM25 = 0.821+0.034Infineon queries; Table 4Table 4: MRR@5 very relevantTable 4
Elo score (averaged over 500 tournaments)RAGF+BM25 = 571; RAG+BM25 = 487RAG+BM25 = 487+84500 tournaments on sampled synthetic queriesTable 6 Elo rankingsTable 6

What To Try In 7 Days

Run RAGElo on a small slice of your internal docs to compare BM25 vs your current embedding retriever.

Generate synthetic evaluation queries by prompting an LLM on representative document passages with a few real queries as examples.

Run a quick RAGElo tournament between your baseline RAG and a RAG-Fusion variant to check completeness vs precision trade-offs.

Agent Features

Planning
query-variation generation for retrieval (RAG-Fusion)
Tool Use
LLM-as-a-judge for pairwise comparisonsreciprocal rank fusion (RRF) to combine rankings
Frameworks
RAGElo (Elo-based tournament for RAG evaluations)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

LLM-as-a-judge shows only moderate agreement with experts and small positive bias.

Experiments run on a single internal product corpus; results may not generalize.

When Not To Use

When you require gold-standard, human-verified reference answers for compliance or legal checks.

When the judge LLM cannot access the same documents or context as the systems being evaluated.

Failure Modes

Judge LLM hallucinates or misses domain facts despite seeing documents.

RAG-Fusion produces comprehensive but imprecise answers that confuse downstream users.

Core Entities

Models

gpt-4-turbogpt-4oClaude 3 OpusClaude 3 SonnetClaude 3 Haikumultilingual-e5-base (embeddings)

Metrics

MRR@5Elo scorePairwise win %Kendall τSpearman ρBland-Altman bias/pair limitsp-values (paired t-tests)

Datasets

Infineon XENSIV Product Selection Guide (117-page corpus)Synthetic query pool N=840 (sampled 200 for eval)

Context Entities

Models

GPT-4 turbo (used to generate synthetic queries and judge in some configs)Anthropic Claude 3 family (query generation)

Datasets

User query examples from Infineon (23 seed queries used as few-shot prompts)