Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you use LLMs to evaluate other models or products, weighting judges by inferred reliability reduces bias, yields calibrated uncertainty, and saves annotation cost when judges or data are limited.
Summary TLDR
The paper extends the classic Bradley–Terry–Luce (BTL) pairwise ranking model with judge-specific discrimination parameters to model differing reliability among LLM judges. The method jointly learns model quality scores and judge reliabilities from pairwise comparisons without ground-truth labels. The estimator is identifiable under normalization, has provable consistency and asymptotic normality, and yields valid confidence intervals. Empirically, weighted aggregation improves alignment with human preferences, reduces confidence-interval width (e.g., 13.5% reduction in one study), and is more data-efficient than standard unweighted aggregation, especially with few judges or comparisons.
Problem Statement
LLM-as-a-judge pipelines assume all judge models are equally reliable. This ignores real heterogeneity across judge LLMs and can produce biased rankings and invalid uncertainty estimates (more data can make evaluations confidently wrong). The goal is to infer both model quality and judge reliability from pairwise comparisons, without ground truth, and produce calibrated rankings with confidence intervals.
Main Contribution
Introduce a judge-aware generalization of the BTL model that multiplies score differences by judge-specific discrimination parameters to capture judge reliability.
Prove identifiability (under simple normalizations) and establish MLE consistency and asymptotic normality enabling Wald confidence intervals for score differences and ranks.
Show empirically on simulations and real benchmarks (MT-Bench, Chatbot Arena, UltraFeedback, and an in-house 45-model dataset) that judge-aware weighting improves alignment with human preferences and increases sample efficiency versus unweighted BTL.
Key Findings
Judge-aware aggregation produces narrower confidence intervals than unweighted aggregation on real data.
Weighted rankings align more closely with human evaluations than unweighted rankings on Chatbot Arena.
Estimator converges at the theoretical rate and yields valid confidence intervals under the model.
Unweighted BTL under-covers confidence intervals and gets worse with more data under judge heterogeneity.
Weighted aggregation improves sample efficiency in limited-data regimes.
Results
Average CI width (in-house 45-model study)
Average CI width (alternate reporting)
Spearman correlation with human ranking (Chatbot Arena)
Convergence rate (simulations)
Who Should Care
What To Try In 7 Days
Run judge-aware aggregation on existing pairwise LLM-judge outputs to estimate judge reliabilities and recompute model ranks.
Compare confidence intervals from judge-aware vs unweighted fits to see if uncertainty narrows.
Subsample comparisons to measure how many fewer examples the weighted model needs to reach acceptable ranking stability.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Assumes judge reliability is constant across tasks; does not model per-task judge drift.
- Relies on i.i.d. random design for theoretical guarantees; real collection bias may violate this.
- One judge with inferred γ≈0 caused numerical instability in experiments.
- Method identifies scores up to scale/location and needs normalizations for uniqueness.
When Not To Use
- When judges can be adversarial or systematically biased in ways not captured by a discrimination parameter.
- When judge reliability clearly varies by task or prompt and you cannot model task-dependent γ.
- When you lack a connected comparison graph linking all candidate models.
Failure Modes
- Misspecification: if judge behavior cannot be captured by a scalar discrimination parameter, rankings and intervals may be misleading.
- Numerical instability when some γ estimates approach zero.
- If comparison sampling is heavily nonuniform, i.i.d. assumptions for inference may fail.
Core Entities
Models
- GPT-4
- GPT-3.5
- Vicuna-13B
- Alpaca-13B
- LLaMA-13B
- Claude-v1
- Claude-Instant-v1
Metrics
- Spearman correlation
- Pearson correlation
- Confidence interval width
- Mean squared error (MSE)
- Empirical coverage
Datasets
- MT-Bench
- Chatbot Arena
- UltraFeedback
- in-house 45-model dataset
Benchmarks
- MT-Bench
- Chatbot Arena
- UltraFeedback

