Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

May 19, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yanbin Yin, Kun Zhou, Zhen Wang, Xiangdong Zhang, Yifei Shao, Shibo Hao, Yi Gu, Jieyuan Liu, Somanshu Singla, Tianyang Liu, Eric P. Xing, Zhengzhong Liu, Haojian Jin, Zhiting Hu

Links

Abstract / PDF

Why It Matters For Business

De-Arena lets teams benchmark many models cheaply and comparably to humans; it reduces cost and single-judge bias so product and platform teams can maintain fair leaderboards and pick models for deployment faster.

Summary TLDR

The paper presents De-Arena, an automatic LLM-evaluation system where every evaluated model also serves as a judge. It combines a coarse-to-fine incremental ranking (binary search + local rerank) and an automatic representative-question selector. On 66 models and nine fine-grained dimensions, De-Arena reaches up to 0.974 Spearman correlation with human-vote Chatbot Arena, while cutting pairwise comparisons by about 4x vs full ranking. The method reduces single-judge bias, scales better as more models join, and outputs stable Elo-style leaderboards.

Problem Statement

Human pairwise voting scales poorly (millions of votes) and single-LLM judges introduce style or self-preference bias. We need an automatic, scalable method that matches human preferences, reduces single-judge bias, and keeps costs reasonable when evaluating many models across many fine-grained dimensions.

Main Contribution

De-Arena: a fully automatic, democratic evaluation framework where evaluated LLMs also vote on pairs.

Coarse-to-fine incremental ranking: binary-search insertion for a rough position, then local in-window reranking to refine.

Representative question selection: ranking-based filter that picks a small set of questions that yield consistent rankings.

Practical evaluation: applied to 66 LLMs across nine fine-grained dimensions and compared to multiple baselines and Chatbot Arena.

Key Findings

De-Arena aligns closely with human judges (Chatbot Arena).

NumbersSpearman ρ = 0.974 (Overall, 66 LLMs)

De-Arena reduces comparison work versus full pairwise ranking.

NumbersAvg comparisons per model: 521,495 vs Full Sample 2,245,874 (MT-Bench)

Democratic multi-judge setup beats single authoritative judges.

NumbersDe-Arena ρ = 0.956 vs single-judge range ρ = 0.815–0.938 (MT-Bench)

Ranking stability improves as more judge models join.

NumbersCorrelation increases with judge pool sizes (best at 26 judges, Figure 3)

Representative question selection yields better consistency than other simple heuristics.

Results

Spearman correlation with Chatbot Arena (Overall)

Value0.974

BaselineChatbot Arena (human)

Spearman correlation with Chatbot Arena (Math)

Value0.959

BaselineChatbot Arena (human)

Average comparison (judge) counts per model (MT-Bench)

Value521,495

BaselineFull-sample ranking 2,245,874

Improvement over single-LLM judges (Spearman)

ValueDe-Arena 0.956 vs single judges 0.815–0.938

BaselineSingle-LLM judge methods

Who Should Care

What To Try In 7 Days

Run De-Arena on a small set (10–20) of internal and public models to compare with current metrics.

Collect 100 open-ended questions for one capability and run the representative question selector to pick top ~32 examples.

Set window size to 1 and base seeds to 6 (recommended) to minimize judge counts while keeping accuracy (per ablations). 1. Check correlation to any existing human or gold-standard

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Emergent group bias remains possible if many models share training data or family tendencies.
  • Performance measured against Chatbot Arena; alignment to other human populations or languages is untested.
  • Requires a large and diverse judge pool to reach top reliability; small pools perform worse.
  • Relies on collected open-source questions; question pools may miss niche or adversarial cases.

When Not To Use

  • When legal, safety, or high-stakes decisions require human oversight and traceability.
  • When you only have a handful of judge models (low diversity).
  • For narrow tasks where closed-form ground truth accuracy is required instead of preference-based ranking.

Failure Modes

  • Groupthink: many similar models amplify shared biases and distort rankings.
  • Self-preference leakage: certain models may still bias votes toward similar-family outputs.
  • Contaminated question pools can produce misleading rankings if they favor a model's training data.
  • Insertion-order edge cases if initial seed set is poorly chosen, though experiments show low variance.

Core Entities

Models

  • GPT-4 / GPT-4o
  • LLaMA-3-70B
  • Gemma-2-27B
  • Qwen2-72B
  • Meta-LLaMA-3.3-70B-instruct
  • ChatGPT-4o-latest
  • Mix of 66 evaluated LLMs (see Table 12)

Metrics

  • Spearman correlation (ρ) vs Chatbot Arena
  • Average judge (comparison) counts
  • Elo scores
  • Win-rate distributions

Datasets

  • MT-Bench
  • MT-Bench math sub-dimensions (Algebra, Geometry, Probability)
  • Various open-source open-ended question pools (collected per-dimension)

Benchmarks

  • Chatbot Arena (human)
  • MixEval
  • LiveBench
  • WildBench
  • Alpaca Eval 2.0
  • Auto Arena
  • PRD
  • Closed-ended benchmarks (CompassAcademic, MMLUPRO, etc.)