Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
De-Arena lets teams benchmark many models cheaply and comparably to humans; it reduces cost and single-judge bias so product and platform teams can maintain fair leaderboards and pick models for deployment faster.
Summary TLDR
The paper presents De-Arena, an automatic LLM-evaluation system where every evaluated model also serves as a judge. It combines a coarse-to-fine incremental ranking (binary search + local rerank) and an automatic representative-question selector. On 66 models and nine fine-grained dimensions, De-Arena reaches up to 0.974 Spearman correlation with human-vote Chatbot Arena, while cutting pairwise comparisons by about 4x vs full ranking. The method reduces single-judge bias, scales better as more models join, and outputs stable Elo-style leaderboards.
Problem Statement
Human pairwise voting scales poorly (millions of votes) and single-LLM judges introduce style or self-preference bias. We need an automatic, scalable method that matches human preferences, reduces single-judge bias, and keeps costs reasonable when evaluating many models across many fine-grained dimensions.
Main Contribution
De-Arena: a fully automatic, democratic evaluation framework where evaluated LLMs also vote on pairs.
Coarse-to-fine incremental ranking: binary-search insertion for a rough position, then local in-window reranking to refine.
Representative question selection: ranking-based filter that picks a small set of questions that yield consistent rankings.
Practical evaluation: applied to 66 LLMs across nine fine-grained dimensions and compared to multiple baselines and Chatbot Arena.
Key Findings
De-Arena aligns closely with human judges (Chatbot Arena).
De-Arena reduces comparison work versus full pairwise ranking.
Democratic multi-judge setup beats single authoritative judges.
Ranking stability improves as more judge models join.
Representative question selection yields better consistency than other simple heuristics.
Results
Spearman correlation with Chatbot Arena (Overall)
Spearman correlation with Chatbot Arena (Math)
Average comparison (judge) counts per model (MT-Bench)
Improvement over single-LLM judges (Spearman)
Who Should Care
What To Try In 7 Days
Run De-Arena on a small set (10–20) of internal and public models to compare with current metrics.
Collect 100 open-ended questions for one capability and run the representative question selector to pick top ~32 examples.
Set window size to 1 and base seeds to 6 (recommended) to minimize judge counts while keeping accuracy (per ablations). 1. Check correlation to any existing human or gold-standard
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Emergent group bias remains possible if many models share training data or family tendencies.
- Performance measured against Chatbot Arena; alignment to other human populations or languages is untested.
- Requires a large and diverse judge pool to reach top reliability; small pools perform worse.
- Relies on collected open-source questions; question pools may miss niche or adversarial cases.
When Not To Use
- When legal, safety, or high-stakes decisions require human oversight and traceability.
- When you only have a handful of judge models (low diversity).
- For narrow tasks where closed-form ground truth accuracy is required instead of preference-based ranking.
Failure Modes
- Groupthink: many similar models amplify shared biases and distort rankings.
- Self-preference leakage: certain models may still bias votes toward similar-family outputs.
- Contaminated question pools can produce misleading rankings if they favor a model's training data.
- Insertion-order edge cases if initial seed set is poorly chosen, though experiments show low variance.
Core Entities
Models
- GPT-4 / GPT-4o
- LLaMA-3-70B
- Gemma-2-27B
- Qwen2-72B
- Meta-LLaMA-3.3-70B-instruct
- ChatGPT-4o-latest
- Mix of 66 evaluated LLMs (see Table 12)
Metrics
- Spearman correlation (ρ) vs Chatbot Arena
- Average judge (comparison) counts
- Elo scores
- Win-rate distributions
Datasets
- MT-Bench
- MT-Bench math sub-dimensions (Algebra, Geometry, Probability)
- Various open-source open-ended question pools (collected per-dimension)
Benchmarks
- Chatbot Arena (human)
- MixEval
- LiveBench
- WildBench
- Alpaca Eval 2.0
- Auto Arena
- PRD
- Closed-ended benchmarks (CompassAcademic, MMLUPRO, etc.)

