Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can get valid, narrower confidence intervals for model evaluation while labeling far fewer examples by using EIF or tuned PPI, which saves annotation cost and gives more precise decisions about model comparisons.
Summary TLDR
LLM-as-a-judge workflows use many LLM labels and few human labels. This paper unifies two debiasing families—measurement-error corrections (Rogan-Gladen/MLE) and surrogate-based methods (PPI/PPI++)—via efficient influence functions (EIF). For mean metrics (e.g., win rate), the EIF (or optimally tuned PPI++) is asymptotically optimal in variance. Simulations and a 140k human-preference dataset show EIF/PPI++ give much narrower confidence intervals and valid coverage; Rogan-Gladen can be extremely wide when judge accuracy or calibration size is low. Code is provided at the authors' GitHub.
Problem Statement
LLM judges are noisy proxies for human labels. Practitioners often have a large pool of LLM judgments and a small calibration set with human labels. Naive averages are biased; existing fixes (misclassification correction vs surrogate correction) differ in efficiency. The problem: how to combine the two datasets to produce valid, low-variance estimates and calibrated confidence intervals for mean outcomes.
Main Contribution
Unified view: cast Rogan-Gladen, PPI, PPI++, and MLE as approximations to the efficient influence-function (EIF) solution for mean estimation with surrogate labels.
Derived an explicit EIF estimator and proved it attains the semiparametric efficiency bound.
Showed that for binary outcomes, optimally tuned PPI++ is asymptotically equivalent to the EIF (and MLE).
Theory + simulations + real-data benchmarks demonstrating EIF/PPI++ give tighter CIs and better finite-sample behavior than Rogan-Gladen, especially with small calibration budgets.
Released implementation and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge .
Key Findings
EIF-based estimators (and optimally tuned PPI++) produce substantially narrower confidence intervals than standard PPI in simulations.
Rogan-Gladen intervals can blow up when judge accuracy or calibration size is low.
On real human-preference data, EIF/PPI++/MLE achieved valid coverage with the narrowest intervals; PPI was wider; RG was very wide.
PPI is unbiased but not generally efficient; for binary outcomes optimally tuned PPI++ equals EIF and thus is efficient.
Results
CI width (simulations)
Accuracy
Real-data mean CI width
Bias
Who Should Care
What To Try In 7 Days
Collect a small random calibration set (aim 5–10% of eval data) and hold the rest as LLM-only test set.
Run PPI++ (tune the λ) or the EIF one-step estimator using a simple linear or GAM calibration for E[Y|Yhat].
Compare CI widths and coverage vs naive and Rogan-Gladen; expect narrower CIs with valid coverage from EIF/PPI++ on binary outcomes.
Optimization Features
Inference Optimization
- Efficient Inference
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Assumes calibration labels are missing completely at random (random split).
- Rogan-Gladen relies on good estimates of sensitivity/specificity; unstable with tiny calibration sets.
- EIF requires a consistent estimator of E[Y | Yhat]; poor calibration fits reduce efficiency.
- Instance-dependent or covariate-dependent judge errors and distribution shift need separate treatment.
When Not To Use
- When calibration labels are non-random or the calibration/test split is shifted.
- When judge errors depend on input features and you do not model those features.
- If you cannot estimate E[Y | Yhat] reasonably (no calibration signal).
Failure Modes
- Extreme RG interval inflation when q0+q1 is close to 1 or calibration is tiny.
- Loss of efficiency if the calibration model for E[Y|Yhat] is misspecified.
- Undercoverage if label shift occurs between calibration and test sets.
- High variance when calibration sample size is too small (e.g., 1% labeled).
Core Entities
Models
- GPT-4o-mini
- GPT-5.2
- Claude Opus 4
- Gemini 2.5 Flash
- Gemini 2.5 Pro
- Qwen3-235B
Metrics
- mean win rate
- bias
- coverage probability
- confidence interval width
- RMSE
Datasets
- arena-human-preference-140k

