Overview
The paper combines formal efficiency theory with simulations and a real preference dataset; results are directly implementable and reduce annotation costs for evaluation pipelines.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
You can get valid, narrower confidence intervals for model evaluation while labeling far fewer examples by using EIF or tuned PPI, which saves annotation cost and gives more precise decisions about model comparisons.
Who Should Care
Summary TLDR
LLM-as-a-judge workflows use many LLM labels and few human labels. This paper unifies two debiasing families—measurement-error corrections (Rogan-Gladen/MLE) and surrogate-based methods (PPI/PPI++)—via efficient influence functions (EIF). For mean metrics (e.g., win rate), the EIF (or optimally tuned PPI++) is asymptotically optimal in variance. Simulations and a 140k human-preference dataset show EIF/PPI++ give much narrower confidence intervals and valid coverage; Rogan-Gladen can be extremely wide when judge accuracy or calibration size is low. Code is provided at the authors' GitHub.
Problem Statement
LLM judges are noisy proxies for human labels. Practitioners often have a large pool of LLM judgments and a small calibration set with human labels. Naive averages are biased; existing fixes (misclassification correction vs surrogate correction) differ in efficiency. The problem: how to combine the two datasets to produce valid, low-variance estimates and calibrated confidence intervals for mean outcomes.
Main Contribution
Unified view: cast Rogan-Gladen, PPI, PPI++, and MLE as approximations to the efficient influence-function (EIF) solution for mean estimation with surrogate labels.
Derived an explicit EIF estimator and proved it attains the semiparametric efficiency bound.
Key Findings
EIF-based estimators (and optimally tuned PPI++) produce substantially narrower confidence intervals than standard PPI in simulations.
Rogan-Gladen intervals can blow up when judge accuracy or calibration size is low.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| CI width (simulations) | EIF/PPI++ 35–55% narrower vs PPI | PPI | 35–55% reduction | Simulation grid (Section 5.3) | Figure 4 | Section 5.3, Figure 4 |
| Accuracy | RG intervals ≈10× wider | EIF/PPI++ | ≈10× | q0=q1=0.6 simulation (Section 5.3) | Figure 4 and text | Section 5.3, Figure 4 |
What To Try In 7 Days
Collect a small random calibration set (aim 5–10% of eval data) and hold the rest as LLM-only test set.
Run PPI++ (tune the λ) or the EIF one-step estimator using a simple linear or GAM calibration for E[Y|Yhat].
Compare CI widths and coverage vs naive and Rogan-Gladen; expect narrower CIs with valid coverage from EIF/PPI++ on binary outcomes.
Optimization Features
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Assumes calibration labels are missing completely at random (random split).
Rogan-Gladen relies on good estimates of sensitivity/specificity; unstable with tiny calibration sets.
When Not To Use
When calibration labels are non-random or the calibration/test split is shifted.
When judge errors depend on input features and you do not model those features.
Failure Modes
Extreme RG interval inflation when q0+q1 is close to 1 or calibration is tiny.
Loss of efficiency if the calibration model for E[Y|Yhat] is misspecified.

