RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Overview

Decision SnapshotNeeds Validation

RAGElo is a practical, usable toolkit for comparative evaluation; evidence is moderate because experiments use a single enterprise corpus and LLM-judge agreement is only moderate, so expect to calibrate with human labels before production rollout.

Citations3

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zackary Rackauckas, Arthur Câmara, Jakub Zavrel

Links

Abstract / PDF / Code

Why It Matters For Business

RAGElo cuts expert labeling cost by using synthetic queries and LLM judges to rank retrieval-augmented systems, so teams can iterate and pick retrieval or fusion strategies faster while keeping a small human calibration step.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

RAGElo is an open-source toolkit that automates evaluation of retrieval-augmented QA systems for private corpora. It builds a synthetic test set by prompting LLMs on document passages, uses a strong LLM as a pairwise judge that sees retrieved documents, and ranks systems via Elo-style tournaments. On Infineon product documents, LLM-judged rankings moderately agree with experts (Kendall τ≈0.56). RAG-Fusion (query variation + reciprocal rank fusion) often wins higher Elo and improves answer completeness but reduces precision; BM25 retrieval outperformed off-the-shelf embeddings in these experiments. Use RAGElo for fast, repeatable system comparisons, not as a drop-in replacement for expert QA.

Problem Statement

Enterprise RAG systems need repeatable, low-cost evaluation but lack large gold-standard QA sets and expert annotations. Standard n-gram metrics fail without reference answers. The paper asks whether synthetic queries plus LLM-as-judge and Elo tournaments can rank RAG variants reliably and whether RAG-Fusion gives better answers.

Main Contribution

RAGElo toolkit: automates retrieval evaluation, pairwise LLM judging, and Elo-style ranking for RAG systems.

A synthetic test-set pipeline: generate evaluation queries by prompting LLMs on long document passages with few-shot real queries.

Key Findings

LLM-as-a-judge moderately matches human experts.

NumbersKendall τ ≈ 0.56, p < 0.01; Spearman ρ ≈ 0.59

Practical UseLLM judges can speed up relative system comparisons, but individual scores are noisy—calibrate with some expert labels before trusting absolute scores.

Evidence RefSection 7.1; Figure 4

RAG-Fusion achieved higher Elo ranking than RAG on these queries.

NumbersElo: RAGF+BM25 = 571 vs RAG+BM25 = 487 (Table 6)

Practical UseIf you want a quick ranking of variants, run RAG-Fusion with rank fusion; expect better overall wins on similar engineering-document tasks.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MRR@5 (very relevant)	RAGF BM25 = 0.855; RAG BM25 = 0.821	RAG BM25 = 0.821	+0.034	Infineon queries; Table 4	Table 4: MRR@5 very relevant	Table 4
Elo score (averaged over 500 tournaments)	RAGF+BM25 = 571; RAG+BM25 = 487	RAG+BM25 = 487	+84	500 tournaments on sampled synthetic queries	Table 6 Elo rankings	Table 6

What To Try In 7 Days

Run RAGElo on a small slice of your internal docs to compare BM25 vs your current embedding retriever.

Generate synthetic evaluation queries by prompting an LLM on representative document passages with a few real queries as examples.

Run a quick RAGElo tournament between your baseline RAG and a RAG-Fusion variant to check completeness vs precision trade-offs.

Agent Features

Planning

query-variation generation for retrieval (RAG-Fusion)

Tool Use

LLM-as-a-judge for pairwise comparisonsreciprocal rank fusion (RRF) to combine rankings

Frameworks

RAGElo (Elo-based tournament for RAG evaluations)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/zetaalphavector/ragelo

Risks & Boundaries

Limitations

LLM-as-a-judge shows only moderate agreement with experts and small positive bias.

Experiments run on a single internal product corpus; results may not generalize.

When Not To Use

When you require gold-standard, human-verified reference answers for compliance or legal checks.

When the judge LLM cannot access the same documents or context as the systems being evaluated.

Failure Modes

Judge LLM hallucinates or misses domain facts despite seeing documents.

RAG-Fusion produces comprehensive but imprecise answers that confuse downstream users.

Core Entities

Models

gpt-4-turbogpt-4oClaude 3 OpusClaude 3 SonnetClaude 3 Haikumultilingual-e5-base (embeddings)

Metrics

MRR@5Elo scorePairwise win %Kendall τSpearman ρBland-Altman bias/pair limitsp-values (paired t-tests)

Datasets

Infineon XENSIV Product Selection Guide (117-page corpus)Synthetic query pool N=840 (sampled 200 for eval)

Context Entities

Models

GPT-4 turbo (used to generate synthetic queries and judge in some configs)Anthropic Claude 3 family (query generation)

Datasets

User query examples from Infineon (23 seed queries used as few-shot prompts)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM-as-a-judge moderately matches human experts.

RAG-Fusion achieved higher Elo ranking than RAG on these queries.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Datasets

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding

WikiContradict: 253 human-curated Wikipedia contradiction QA pairs to test LLMs under RAG

Key finding