Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

May 19, 20258 min

Overview

Decision SnapshotNeeds Validation

De-Arena demonstrates robust correlation with human judgments across many models and dimensions, with ablations and stability checks; evidence is empirical but centered on 66 LLMs and nine dimensions.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yanbin Yin, Kun Zhou, Zhen Wang, Xiangdong Zhang, Yifei Shao, Shibo Hao, Yi Gu, Jieyuan Liu, Somanshu Singla, Tianyang Liu, Eric P. Xing, Zhengzhong Liu, Haojian Jin, Zhiting Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

De-Arena lets teams benchmark many models cheaply and comparably to humans; it reduces cost and single-judge bias so product and platform teams can maintain fair leaderboards and pick models for deployment faster.

Who Should Care

Summary TLDR

The paper presents De-Arena, an automatic LLM-evaluation system where every evaluated model also serves as a judge. It combines a coarse-to-fine incremental ranking (binary search + local rerank) and an automatic representative-question selector. On 66 models and nine fine-grained dimensions, De-Arena reaches up to 0.974 Spearman correlation with human-vote Chatbot Arena, while cutting pairwise comparisons by about 4x vs full ranking. The method reduces single-judge bias, scales better as more models join, and outputs stable Elo-style leaderboards.

Problem Statement

Human pairwise voting scales poorly (millions of votes) and single-LLM judges introduce style or self-preference bias. We need an automatic, scalable method that matches human preferences, reduces single-judge bias, and keeps costs reasonable when evaluating many models across many fine-grained dimensions.

Main Contribution

De-Arena: a fully automatic, democratic evaluation framework where evaluated LLMs also vote on pairs.

Coarse-to-fine incremental ranking: binary-search insertion for a rough position, then local in-window reranking to refine.

Key Findings

De-Arena aligns closely with human judges (Chatbot Arena).

NumbersSpearman ρ = 0.974 (Overall, 66 LLMs)

Practical UseUse De-Arena as a cost-effective proxy for human preference in many benchmarking tasks; expect near-human agreement on rankings for large model pools.

Evidence RefTable 2

De-Arena reduces comparison work versus full pairwise ranking.

NumbersAvg comparisons per model: 521,495 vs Full Sample 2,245,874 (MT-Bench)

Practical UseYou can evaluate many models with roughly 4x fewer model-vote operations, lowering compute and time needed to maintain leaderboards.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Spearman correlation with Chatbot Arena (Overall)0.974Chatbot Arena (human)N/A66 LLMsTable 2 reports De-Arena 0.974 overall correlation with Chatbot Arena on 66 modelsTable 2
Spearman correlation with Chatbot Arena (Math)0.959Chatbot Arena (human)N/A66 LLMs (Math)Table 2 shows De-Arena Math correlation 0.959 on 66 modelsTable 2

What To Try In 7 Days

Run De-Arena on a small set (10–20) of internal and public models to compare with current metrics.

Collect 100 open-ended questions for one capability and run the representative question selector to pick top ~32 examples.

Set window size to 1 and base seeds to 6 (recommended) to minimize judge counts while keeping accuracy (per ablations). 1. Check correlation to any existing human or gold-standard

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Emergent group bias remains possible if many models share training data or family tendencies.

Performance measured against Chatbot Arena; alignment to other human populations or languages is untested.

When Not To Use

When legal, safety, or high-stakes decisions require human oversight and traceability.

When you only have a handful of judge models (low diversity).

Failure Modes

Groupthink: many similar models amplify shared biases and distort rankings.

Self-preference leakage: certain models may still bias votes toward similar-family outputs.

Core Entities

Models

GPT-4 / GPT-4oLLaMA-3-70BGemma-2-27BQwen2-72BMeta-LLaMA-3.3-70B-instructChatGPT-4o-latestMix of 66 evaluated LLMs (see Table 12)

Metrics

Spearman correlation (ρ) vs Chatbot ArenaAverage judge (comparison) countsElo scoresWin-rate distributions

Datasets

MT-BenchMT-Bench math sub-dimensions (Algebra, Geometry, Probability)Various open-source open-ended question pools (collected per-dimension)

Benchmarks

Chatbot Arena (human)MixEvalLiveBenchWildBenchAlpaca Eval 2.0Auto ArenaPRDClosed-ended benchmarks (CompassAcademic, MMLUPRO, etc.)