Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

January 8, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yiqun T Chen, Sizhu Lu, Sijia Li, Moran Guo, Shengyi Li

Links

Abstract / PDF

Why It Matters For Business

You can get valid, narrower confidence intervals for model evaluation while labeling far fewer examples by using EIF or tuned PPI, which saves annotation cost and gives more precise decisions about model comparisons.

Summary TLDR

LLM-as-a-judge workflows use many LLM labels and few human labels. This paper unifies two debiasing families—measurement-error corrections (Rogan-Gladen/MLE) and surrogate-based methods (PPI/PPI++)—via efficient influence functions (EIF). For mean metrics (e.g., win rate), the EIF (or optimally tuned PPI++) is asymptotically optimal in variance. Simulations and a 140k human-preference dataset show EIF/PPI++ give much narrower confidence intervals and valid coverage; Rogan-Gladen can be extremely wide when judge accuracy or calibration size is low. Code is provided at the authors' GitHub.

Problem Statement

LLM judges are noisy proxies for human labels. Practitioners often have a large pool of LLM judgments and a small calibration set with human labels. Naive averages are biased; existing fixes (misclassification correction vs surrogate correction) differ in efficiency. The problem: how to combine the two datasets to produce valid, low-variance estimates and calibrated confidence intervals for mean outcomes.

Main Contribution

Unified view: cast Rogan-Gladen, PPI, PPI++, and MLE as approximations to the efficient influence-function (EIF) solution for mean estimation with surrogate labels.

Derived an explicit EIF estimator and proved it attains the semiparametric efficiency bound.

Showed that for binary outcomes, optimally tuned PPI++ is asymptotically equivalent to the EIF (and MLE).

Theory + simulations + real-data benchmarks demonstrating EIF/PPI++ give tighter CIs and better finite-sample behavior than Rogan-Gladen, especially with small calibration budgets.

Released implementation and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge .

Key Findings

EIF-based estimators (and optimally tuned PPI++) produce substantially narrower confidence intervals than standard PPI in simulations.

NumbersCI width 35–55% smaller vs PPI (Section 5.3; Fig.4)

Rogan-Gladen intervals can blow up when judge accuracy or calibration size is low.

NumbersRG intervals ≈10× wider at q0=q1=0.6 in sims (Section 5.3; Fig.4)

On real human-preference data, EIF/PPI++/MLE achieved valid coverage with the narrowest intervals; PPI was wider; RG was very wide.

NumbersReal-data mean CI widths: EIF/PPI++/MLE 0.27–0.30; PPI 0.35–0.44; RG 0.76–0.96 (Section 6.2)

PPI is unbiased but not generally efficient; for binary outcomes optimally tuned PPI++ equals EIF and thus is efficient.

NumbersProven equivalence of PPI++ and EIF in binary case (Section 3, propositions)

Results

CI width (simulations)

ValueEIF/PPI++ 35–55% narrower vs PPI

BaselinePPI

Accuracy

ValueRG intervals ≈10× wider

BaselineEIF/PPI++

Real-data mean CI width

ValueEIF/PPI++/MLE 0.27–0.30; PPI 0.35–0.44; RG 0.76–0.96

Baselinenaive/RG comparison

Bias

ValuePPI unbiased; naive biased; RG shows larger finite-sample bias

Baselinenaive

Who Should Care

What To Try In 7 Days

Collect a small random calibration set (aim 5–10% of eval data) and hold the rest as LLM-only test set.

Run PPI++ (tune the λ) or the EIF one-step estimator using a simple linear or GAM calibration for E[Y|Yhat].

Compare CI widths and coverage vs naive and Rogan-Gladen; expect narrower CIs with valid coverage from EIF/PPI++ on binary outcomes.

Optimization Features

Inference Optimization

  • Efficient Inference

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Assumes calibration labels are missing completely at random (random split).
  • Rogan-Gladen relies on good estimates of sensitivity/specificity; unstable with tiny calibration sets.
  • EIF requires a consistent estimator of E[Y | Yhat]; poor calibration fits reduce efficiency.
  • Instance-dependent or covariate-dependent judge errors and distribution shift need separate treatment.

When Not To Use

  • When calibration labels are non-random or the calibration/test split is shifted.
  • When judge errors depend on input features and you do not model those features.
  • If you cannot estimate E[Y | Yhat] reasonably (no calibration signal).

Failure Modes

  • Extreme RG interval inflation when q0+q1 is close to 1 or calibration is tiny.
  • Loss of efficiency if the calibration model for E[Y|Yhat] is misspecified.
  • Undercoverage if label shift occurs between calibration and test sets.
  • High variance when calibration sample size is too small (e.g., 1% labeled).

Core Entities

Models

  • GPT-4o-mini
  • GPT-5.2
  • Claude Opus 4
  • Gemini 2.5 Flash
  • Gemini 2.5 Pro
  • Qwen3-235B

Metrics

  • mean win rate
  • bias
  • coverage probability
  • confidence interval width
  • RMSE

Datasets

  • arena-human-preference-140k