Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

January 29, 20266 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Peijia Qin, Ruiyi Zhang, Qi Cao, Pengtao Xie

Links

Abstract / PDF

Why It Matters For Business

DAJ improves which sampled code solution gets chosen at inference, raising accuracy especially on hard cases without changing core models; that can cut debugging cost and improve customer-facing code reliability.

Summary TLDR

DAJ trains a reasoning-based LLM judge using bi-level data reweighting so the judge focuses on hard, in-distribution, and trajectory-aligned examples. The judge produces step-by-step code evaluations and is trained with verifiable (auto-checkable) rewards. On LiveCodeBench DAJ reaches 84.7% pass@1 (± reported baselines), improving hardest-case pass rates (73.8% vs 67.2) and matching or beating prior test-time scaling methods on BigCodeBench (35.9%). The approach is plug-in: keep your policy model, add DAJ as an inference-time judge, and gain selection quality without changing the base model.

Problem Statement

Training LLM judges for Best-of-N selection fails in practice because training data differ from test-time data in three ways: (1) easy problems dominate, (2) task distributions shift over time and by platform, and (3) training trajectories often come from cheaper models that behave differently than inference-time models. DAJ learns importance weights over training data (domain- or instance-level) via a bi-level optimization that optimizes judge generalization on a held-out meta set aligned to the test distribution.

Main Contribution

A bi-level data-reweighting framework that learns domain- or instance-level weights to train an LLM judge focused on hard, in-distribution, and trajectory-aligned examples.

A reasoning-based judge that generates step-by-step verification before selecting candidates, trained with verifiable rewards (no human reasoning labels required).

Comprehensive experiments showing DAJ improves Best-of-N selection on LiveCodeBench and BigCodeBench and across multiple policy models.

Key Findings

DAJ improves overall pass@1 on LiveCodeBench to 84.7%.

Numbers84.7% overall; +7.6 pts vs o4-mini (high)

DAJ lifts hard-problem pass rates substantially.

Numbers73.8% on hard problems vs 67.2% for best baseline

DAJ matches or edges prior methods on BigCodeBench.

Numbers35.9% overall vs 35.2% (Skywork-o1 PRM)

All reweighting strategies help; instance-net performs best in experiments.

NumbersPreference-optimization average: instance net 83.5% vs no reweighting 82.2%

Results

LiveCodeBench overall pass@1

Value84.7%

Baselineo4-mini (high) 77.1%

LiveCodeBench hard split pass@1

Value73.8%

BaselineBest baseline 67.2%

BigCodeBench overall pass@1

Value35.9%

BaselineSkywork-o1 PRM 35.2%

Average across policy models (pass@1)

Value62.1%

BaselineRandom 58.4%

Who Should Care

What To Try In 7 Days

Build a small meta set aligned to your target tasks (recent or hardest problems) and hold it out for validation.

Fine-tune a lightweight reasoning judge with LoRA and pairwise prompts that produce step-by-step analysis.

Implement instance-net reweighting: train an MLP that predicts sample weights from loss and tune meta-learning rates via one-step unrolling.

Optimization Features

Training Optimization

  • bi-level data reweighting
  • preference optimization (DPO, KTO, ORPO)
  • GRPO

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires a held-out meta set that closely matches test-time tasks; constructing it needs care.
  • Adds inference latency: the judge runs multi-round pairwise reasoning for Best-of-N selection.
  • Effectiveness depends on the quality of candidate-generation policies; poor base models limit selection gains.

When Not To Use

  • When you cannot afford extra inference-time compute or latency for judge reasoning.
  • When no reliable meta-set can be built that matches target tasks.
  • If candidate generators are extremely weak and rarely produce correct solutions.

Failure Modes

  • Judge may overfit to meta set if meta set is too small or unrepresentative.
  • Reweighting can upweight noisy but meta-correlated samples, amplifying dataset artifacts.
  • Performance gains shrink for policy models that rarely produce correct candidates (low ceiling).

Core Entities

Models

  • Qwen3-Coder-30B-A3B
  • Qwen2.5-Coder-32B
  • DeepSeek V3.2 Speciale
  • o4-mini (high)
  • Qwen2.5-Coder-14B (fine-tuned judge)
  • Qwen3-1.7B (judge RL backbone)

Metrics

  • Accuracy

Datasets

  • LiveCodeBench
  • BigCodeBench

Benchmarks

  • LiveCodeBench
  • BigCodeBench