Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Overview

Decision SnapshotNeeds Validation

De-Arena demonstrates robust correlation with human judgments across many models and dimensions, with ablations and stability checks; evidence is empirical but centered on 66 LLMs and nine dimensions.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yanbin Yin, Kun Zhou, Zhen Wang, Xiangdong Zhang, Yifei Shao, Shibo Hao, Yi Gu, Jieyuan Liu, Somanshu Singla, Tianyang Liu, Eric P. Xing, Zhengzhong Liu, Haojian Jin, Zhiting Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

De-Arena lets teams benchmark many models cheaply and comparably to humans; it reduces cost and single-judge bias so product and platform teams can maintain fair leaderboards and pick models for deployment faster.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The paper presents De-Arena, an automatic LLM-evaluation system where every evaluated model also serves as a judge. It combines a coarse-to-fine incremental ranking (binary search + local rerank) and an automatic representative-question selector. On 66 models and nine fine-grained dimensions, De-Arena reaches up to 0.974 Spearman correlation with human-vote Chatbot Arena, while cutting pairwise comparisons by about 4x vs full ranking. The method reduces single-judge bias, scales better as more models join, and outputs stable Elo-style leaderboards.

Problem Statement

Human pairwise voting scales poorly (millions of votes) and single-LLM judges introduce style or self-preference bias. We need an automatic, scalable method that matches human preferences, reduces single-judge bias, and keeps costs reasonable when evaluating many models across many fine-grained dimensions.

Main Contribution

De-Arena: a fully automatic, democratic evaluation framework where evaluated LLMs also vote on pairs.

Coarse-to-fine incremental ranking: binary-search insertion for a rough position, then local in-window reranking to refine.

Key Findings

De-Arena aligns closely with human judges (Chatbot Arena).

NumbersSpearman ρ = 0.974 (Overall, 66 LLMs)

Practical UseUse De-Arena as a cost-effective proxy for human preference in many benchmarking tasks; expect near-human agreement on rankings for large model pools.

Evidence RefTable 2

De-Arena reduces comparison work versus full pairwise ranking.

NumbersAvg comparisons per model: 521,495 vs Full Sample 2,245,874 (MT-Bench)

Practical UseYou can evaluate many models with roughly 4x fewer model-vote operations, lowering compute and time needed to maintain leaderboards.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Spearman correlation with Chatbot Arena (Overall)	0.974	Chatbot Arena (human)	N/A	66 LLMs	Table 2 reports De-Arena 0.974 overall correlation with Chatbot Arena on 66 models	Table 2
Spearman correlation with Chatbot Arena (Math)	0.959	Chatbot Arena (human)	N/A	66 LLMs (Math)	Table 2 shows De-Arena Math correlation 0.959 on 66 models	Table 2

What To Try In 7 Days

Run De-Arena on a small set (10–20) of internal and public models to compare with current metrics.

Collect 100 open-ended questions for one capability and run the representative question selector to pick top ~32 examples.

Set window size to 1 and base seeds to 6 (recommended) to minimize judge counts while keeping accuracy (per ablations). 1. Check correlation to any existing human or gold-standard

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/maitrix-org/de-arena

Data URLs

https://github.com/maitrix-org/de-arena (promised release)

Risks & Boundaries

Limitations

Emergent group bias remains possible if many models share training data or family tendencies.

Performance measured against Chatbot Arena; alignment to other human populations or languages is untested.

When Not To Use

When legal, safety, or high-stakes decisions require human oversight and traceability.

When you only have a handful of judge models (low diversity).

Failure Modes

Groupthink: many similar models amplify shared biases and distort rankings.

Self-preference leakage: certain models may still bias votes toward similar-family outputs.

Core Entities

Models

GPT-4 / GPT-4oLLaMA-3-70BGemma-2-27BQwen2-72BMeta-LLaMA-3.3-70B-instructChatGPT-4o-latestMix of 66 evaluated LLMs (see Table 12)

Metrics

Spearman correlation (ρ) vs Chatbot ArenaAverage judge (comparison) countsElo scoresWin-rate distributions

Datasets

MT-BenchMT-Bench math sub-dimensions (Algebra, Geometry, Probability)Various open-source open-ended question pools (collected per-dimension)

Benchmarks

Chatbot Arena (human)MixEvalLiveBenchWildBenchAlpaca Eval 2.0Auto ArenaPRDClosed-ended benchmarks (CompassAcademic, MMLUPRO, etc.)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

De-Arena aligns closely with human judges (Chatbot Arena).

De-Arena reduces comparison work versus full pairwise ranking.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding