Overview
De-Arena demonstrates robust correlation with human judgments across many models and dimensions, with ablations and stability checks; evidence is empirical but centered on 66 LLMs and nine dimensions.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
De-Arena lets teams benchmark many models cheaply and comparably to humans; it reduces cost and single-judge bias so product and platform teams can maintain fair leaderboards and pick models for deployment faster.
Who Should Care
Summary TLDR
The paper presents De-Arena, an automatic LLM-evaluation system where every evaluated model also serves as a judge. It combines a coarse-to-fine incremental ranking (binary search + local rerank) and an automatic representative-question selector. On 66 models and nine fine-grained dimensions, De-Arena reaches up to 0.974 Spearman correlation with human-vote Chatbot Arena, while cutting pairwise comparisons by about 4x vs full ranking. The method reduces single-judge bias, scales better as more models join, and outputs stable Elo-style leaderboards.
Problem Statement
Human pairwise voting scales poorly (millions of votes) and single-LLM judges introduce style or self-preference bias. We need an automatic, scalable method that matches human preferences, reduces single-judge bias, and keeps costs reasonable when evaluating many models across many fine-grained dimensions.
Main Contribution
De-Arena: a fully automatic, democratic evaluation framework where evaluated LLMs also vote on pairs.
Coarse-to-fine incremental ranking: binary-search insertion for a rough position, then local in-window reranking to refine.
Key Findings
De-Arena aligns closely with human judges (Chatbot Arena).
De-Arena reduces comparison work versus full pairwise ranking.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Spearman correlation with Chatbot Arena (Overall) | 0.974 | Chatbot Arena (human) | N/A | 66 LLMs | Table 2 reports De-Arena 0.974 overall correlation with Chatbot Arena on 66 models | Table 2 |
| Spearman correlation with Chatbot Arena (Math) | 0.959 | Chatbot Arena (human) | N/A | 66 LLMs (Math) | Table 2 shows De-Arena Math correlation 0.959 on 66 models | Table 2 |
What To Try In 7 Days
Run De-Arena on a small set (10–20) of internal and public models to compare with current metrics.
Collect 100 open-ended questions for one capability and run the representative question selector to pick top ~32 examples.
Set window size to 1 and base seeds to 6 (recommended) to minimize judge counts while keeping accuracy (per ablations). 1. Check correlation to any existing human or gold-standard
Reproducibility
Risks & Boundaries
Limitations
Emergent group bias remains possible if many models share training data or family tendencies.
Performance measured against Chatbot Arena; alignment to other human populations or languages is untested.
When Not To Use
When legal, safety, or high-stakes decisions require human oversight and traceability.
When you only have a handful of judge models (low diversity).
Failure Modes
Groupthink: many similar models amplify shared biases and distort rankings.
Self-preference leakage: certain models may still bias votes toward similar-family outputs.

