Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

January 29, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Mingyuan Xu, Xinzi Tan, Jiawei Wu, Doudou Zhou

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs to evaluate other models or products, weighting judges by inferred reliability reduces bias, yields calibrated uncertainty, and saves annotation cost when judges or data are limited.

Summary TLDR

The paper extends the classic Bradley–Terry–Luce (BTL) pairwise ranking model with judge-specific discrimination parameters to model differing reliability among LLM judges. The method jointly learns model quality scores and judge reliabilities from pairwise comparisons without ground-truth labels. The estimator is identifiable under normalization, has provable consistency and asymptotic normality, and yields valid confidence intervals. Empirically, weighted aggregation improves alignment with human preferences, reduces confidence-interval width (e.g., 13.5% reduction in one study), and is more data-efficient than standard unweighted aggregation, especially with few judges or comparisons.

Problem Statement

LLM-as-a-judge pipelines assume all judge models are equally reliable. This ignores real heterogeneity across judge LLMs and can produce biased rankings and invalid uncertainty estimates (more data can make evaluations confidently wrong). The goal is to infer both model quality and judge reliability from pairwise comparisons, without ground truth, and produce calibrated rankings with confidence intervals.

Main Contribution

Introduce a judge-aware generalization of the BTL model that multiplies score differences by judge-specific discrimination parameters to capture judge reliability.

Prove identifiability (under simple normalizations) and establish MLE consistency and asymptotic normality enabling Wald confidence intervals for score differences and ranks.

Show empirically on simulations and real benchmarks (MT-Bench, Chatbot Arena, UltraFeedback, and an in-house 45-model dataset) that judge-aware weighting improves alignment with human preferences and increases sample efficiency versus unweighted BTL.

Key Findings

Judge-aware aggregation produces narrower confidence intervals than unweighted aggregation on real data.

NumbersAverage CI width 0.185 vs 0.211 (13.5% narrower)

Weighted rankings align more closely with human evaluations than unweighted rankings on Chatbot Arena.

NumbersSpearman ρ=0.9955 (weighted) vs ρ=0.9699 (unweighted)

Estimator converges at the theoretical rate and yields valid confidence intervals under the model.

NumbersMSE slopes ~ -1 in log-log plots (consistent with O(1/T))

Unweighted BTL under-covers confidence intervals and gets worse with more data under judge heterogeneity.

NumbersEmpirical coverage <95% and declines as T increases in simulations

Weighted aggregation improves sample efficiency in limited-data regimes.

NumbersWeighted model reaches given correlation with fewer comparisons across judge budgets K∈{4,8,12,16}

Results

Average CI width (in-house 45-model study)

Value0.185 (weighted) vs 0.211 (unweighted)

Baselineunweighted BTL

Average CI width (alternate reporting)

Value0.264 (weighted) vs 0.278 (unweighted)

Baselineunweighted BTL

Spearman correlation with human ranking (Chatbot Arena)

Valueρ = 0.9955 (weighted) vs ρ = 0.9699 (unweighted)

Baselineunweighted BTL

Convergence rate (simulations)

ValueMSE slopes ≈ -1 in log-log plots

Baselinetheoretical O(1/T) rate

Who Should Care

What To Try In 7 Days

Run judge-aware aggregation on existing pairwise LLM-judge outputs to estimate judge reliabilities and recompute model ranks.

Compare confidence intervals from judge-aware vs unweighted fits to see if uncertainty narrows.

Subsample comparisons to measure how many fewer examples the weighted model needs to reach acceptable ranking stability.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Assumes judge reliability is constant across tasks; does not model per-task judge drift.
  • Relies on i.i.d. random design for theoretical guarantees; real collection bias may violate this.
  • One judge with inferred γ≈0 caused numerical instability in experiments.
  • Method identifies scores up to scale/location and needs normalizations for uniqueness.

When Not To Use

  • When judges can be adversarial or systematically biased in ways not captured by a discrimination parameter.
  • When judge reliability clearly varies by task or prompt and you cannot model task-dependent γ.
  • When you lack a connected comparison graph linking all candidate models.

Failure Modes

  • Misspecification: if judge behavior cannot be captured by a scalar discrimination parameter, rankings and intervals may be misleading.
  • Numerical instability when some γ estimates approach zero.
  • If comparison sampling is heavily nonuniform, i.i.d. assumptions for inference may fail.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • Vicuna-13B
  • Alpaca-13B
  • LLaMA-13B
  • Claude-v1
  • Claude-Instant-v1

Metrics

  • Spearman correlation
  • Pearson correlation
  • Confidence interval width
  • Mean squared error (MSE)
  • Empirical coverage

Datasets

  • MT-Bench
  • Chatbot Arena
  • UltraFeedback
  • in-house 45-model dataset

Benchmarks

  • MT-Bench
  • Chatbot Arena
  • UltraFeedback