RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs
RAGElo cuts expert labeling cost by using synthetic queries and LLM judges to rank retrieval-augmented systems, so teams can iterate and pick retrieval or fusion strategies faster while keeping a small human calibration step.
Key finding
LLM-as-a-judge moderately matches human experts.
Numbers: Kendall τ ≈ 0.56, p < 0.01; Spearman ρ ≈ 0.59

