Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

January 29, 20266 min

Overview

Decision SnapshotNeeds Validation

The method is practical: uses standard tools (LoRA, Betty), reports wins on public benchmarks, and requires extra training and inference compute; evidence comes from leaderboard comparisons and ablations on two benchmarks.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Peijia Qin, Ruiyi Zhang, Qi Cao, Pengtao Xie

Links

Abstract / PDF / Data

Why It Matters For Business

DAJ improves which sampled code solution gets chosen at inference, raising accuracy especially on hard cases without changing core models; that can cut debugging cost and improve customer-facing code reliability.

Who Should Care

Summary TLDR

DAJ trains a reasoning-based LLM judge using bi-level data reweighting so the judge focuses on hard, in-distribution, and trajectory-aligned examples. The judge produces step-by-step code evaluations and is trained with verifiable (auto-checkable) rewards. On LiveCodeBench DAJ reaches 84.7% pass@1 (± reported baselines), improving hardest-case pass rates (73.8% vs 67.2) and matching or beating prior test-time scaling methods on BigCodeBench (35.9%). The approach is plug-in: keep your policy model, add DAJ as an inference-time judge, and gain selection quality without changing the base model.

Problem Statement

Training LLM judges for Best-of-N selection fails in practice because training data differ from test-time data in three ways: (1) easy problems dominate, (2) task distributions shift over time and by platform, and (3) training trajectories often come from cheaper models that behave differently than inference-time models. DAJ learns importance weights over training data (domain- or instance-level) via a bi-level optimization that optimizes judge generalization on a held-out meta set aligned to the test distribution.

Main Contribution

A bi-level data-reweighting framework that learns domain- or instance-level weights to train an LLM judge focused on hard, in-distribution, and trajectory-aligned examples.

A reasoning-based judge that generates step-by-step verification before selecting candidates, trained with verifiable rewards (no human reasoning labels required).

Key Findings

DAJ improves overall pass@1 on LiveCodeBench to 84.7%.

Numbers84.7% overall; +7.6 pts vs o4-mini (high)

Practical UseUsing DAJ as the judge for Best-of-N selection can raise overall correct-selection rates notably on LiveCodeBench-style tasks.

Evidence RefTable 2; Section 5.2

DAJ lifts hard-problem pass rates substantially.

Numbers73.8% on hard problems vs 67.2% for best baseline

Practical UseIf your workload contains many hard examples, train a data-weighted judge to prioritize difficult samples and improve worst-case performance.

Evidence RefTable 2; Section 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LiveCodeBench overall pass@184.7%o4-mini (high) 77.1%+7.6 ptsLiveCodeBench (overall)Table 2; Section 5.2Table 2
LiveCodeBench hard split pass@173.8%Best baseline 67.2%+6.6 ptsLiveCodeBench (hard)Table 2; Section 5.2Table 2

What To Try In 7 Days

Build a small meta set aligned to your target tasks (recent or hardest problems) and hold it out for validation.

Fine-tune a lightweight reasoning judge with LoRA and pairwise prompts that produce step-by-step analysis.

Implement instance-net reweighting: train an MLP that predicts sample weights from loss and tune meta-learning rates via one-step unrolling.

Optimization Features

Training Optimization
bi-level data reweightingpreference optimization (DPO, KTO, ORPO)GRPO

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires a held-out meta set that closely matches test-time tasks; constructing it needs care.

Adds inference latency: the judge runs multi-round pairwise reasoning for Best-of-N selection.

When Not To Use

When you cannot afford extra inference-time compute or latency for judge reasoning.

When no reliable meta-set can be built that matches target tasks.

Failure Modes

Judge may overfit to meta set if meta set is too small or unrepresentative.

Reweighting can upweight noisy but meta-correlated samples, amplifying dataset artifacts.

Core Entities

Models

Qwen3-Coder-30B-A3BQwen2.5-Coder-32BDeepSeek V3.2 Specialeo4-mini (high)Qwen2.5-Coder-14B (fine-tuned judge)Qwen3-1.7B (judge RL backbone)

Metrics

Accuracy

Datasets

LiveCodeBenchBigCodeBench

Benchmarks

LiveCodeBenchBigCodeBench