RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

June 20, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

3

Authors

Zackary Rackauckas, Arthur Câmara, Jakub Zavrel

Links

Abstract / PDF

Why It Matters For Business

RAGElo cuts expert labeling cost by using synthetic queries and LLM judges to rank retrieval-augmented systems, so teams can iterate and pick retrieval or fusion strategies faster while keeping a small human calibration step.

Summary TLDR

RAGElo is an open-source toolkit that automates evaluation of retrieval-augmented QA systems for private corpora. It builds a synthetic test set by prompting LLMs on document passages, uses a strong LLM as a pairwise judge that sees retrieved documents, and ranks systems via Elo-style tournaments. On Infineon product documents, LLM-judged rankings moderately agree with experts (Kendall τ≈0.56). RAG-Fusion (query variation + reciprocal rank fusion) often wins higher Elo and improves answer completeness but reduces precision; BM25 retrieval outperformed off-the-shelf embeddings in these experiments. Use RAGElo for fast, repeatable system comparisons, not as a drop-in replacement for expert QA.

Problem Statement

Enterprise RAG systems need repeatable, low-cost evaluation but lack large gold-standard QA sets and expert annotations. Standard n-gram metrics fail without reference answers. The paper asks whether synthetic queries plus LLM-as-judge and Elo tournaments can rank RAG variants reliably and whether RAG-Fusion gives better answers.

Main Contribution

RAGElo toolkit: automates retrieval evaluation, pairwise LLM judging, and Elo-style ranking for RAG systems.

A synthetic test-set pipeline: generate evaluation queries by prompting LLMs on long document passages with few-shot real queries.

Empirical comparison on Infineon product docs showing RAG-Fusion yields higher Elo and greater completeness but lower precision versus standard RAG.

Evidence that LLM-as-a-judge moderately aligns with domain experts (statistical correlations and Bland-Altman analysis).

Practical guidance: BM25 beat off-the-shelf embeddings for this domain; rank fusion improved retrieval ranks.

Key Findings

LLM-as-a-judge moderately matches human experts.

NumbersKendall τ ≈ 0.56, p < 0.01; Spearman ρ ≈ 0.59

RAG-Fusion achieved higher Elo ranking than RAG on these queries.

NumbersElo: RAGF+BM25 = 571 vs RAG+BM25 = 487 (Table 6)

RAG-Fusion improved completeness but reduced precision versus RAG (expert judgments).

Numberspaired t-tests: completeness p ≈ 0.01 (RAGF > RAG); precision p ≈ 0.04 (RAG > RAGF)

BM25 outperformed off-the-shelf vector embeddings on retrieval for this domain.

NumbersMRR@5 (very relevant): RAGF BM25 0.855 vs RAGF KNN 0.396; RAG BM25 0.821 vs RAG KNN 0.407

LLM judge scores show a small positive bias and wide individual differences vs humans.

NumbersBland-Altman bias ≈ 0.12; limits ≈ -1.17 to 1.41

Results

MRR@5 (very relevant)

ValueRAGF BM25 = 0.855; RAG BM25 = 0.821

BaselineRAG BM25 = 0.821

Elo score (averaged over 500 tournaments)

ValueRAGF+BM25 = 571; RAG+BM25 = 487

BaselineRAG+BM25 = 487

Pairwise win % (BM25)

ValueRAGF 49%; RAG 14.5%; Tie 36.5%

Baselinetie/uniform

Judge vs expert agreement

ValueKendall τ ≈ 0.56; Spearman ρ ≈ 0.59

Baselineno agreement

Who Should Care

What To Try In 7 Days

Run RAGElo on a small slice of your internal docs to compare BM25 vs your current embedding retriever.

Generate synthetic evaluation queries by prompting an LLM on representative document passages with a few real queries as examples.

Run a quick RAGElo tournament between your baseline RAG and a RAG-Fusion variant to check completeness vs precision trade-offs.

Agent Features

Planning

  • query-variation generation for retrieval (RAG-Fusion)

Tool Use

  • LLM-as-a-judge for pairwise comparisons
  • reciprocal rank fusion (RRF) to combine rankings

Frameworks

  • RAGElo (Elo-based tournament for RAG evaluations)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LLM-as-a-judge shows only moderate agreement with experts and small positive bias.
  • Experiments run on a single internal product corpus; results may not generalize.
  • Synthetic queries depend on prompt design and injected passages; they may not match live user behavior.

When Not To Use

  • When you require gold-standard, human-verified reference answers for compliance or legal checks.
  • When the judge LLM cannot access the same documents or context as the systems being evaluated.
  • When strict product-level precision is mandatory and cannot be compensated by completeness.

Failure Modes

  • Judge LLM hallucinates or misses domain facts despite seeing documents.
  • RAG-Fusion produces comprehensive but imprecise answers that confuse downstream users.
  • Embedding model mismatch yields poor KNN retrieval, misleading comparisons.

Core Entities

Models

  • gpt-4-turbo
  • gpt-4o
  • Claude 3 Opus
  • Claude 3 Sonnet
  • Claude 3 Haiku
  • multilingual-e5-base (embeddings)

Metrics

  • MRR@5
  • Elo score
  • Pairwise win %
  • Kendall τ
  • Spearman ρ
  • Bland-Altman bias/pair limits
  • p-values (paired t-tests)

Datasets

  • Infineon XENSIV Product Selection Guide (117-page corpus)
  • Synthetic query pool N=840 (sampled 200 for eval)

Context Entities

Models

  • GPT-4 turbo (used to generate synthetic queries and judge in some configs)
  • Anthropic Claude 3 family (query generation)

Datasets

  • User query examples from Infineon (23 seed queries used as few-shot prompts)