Overview
The method is practical: uses standard tools (LoRA, Betty), reports wins on public benchmarks, and requires extra training and inference compute; evidence comes from leaderboard comparisons and ablations on two benchmarks.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
DAJ improves which sampled code solution gets chosen at inference, raising accuracy especially on hard cases without changing core models; that can cut debugging cost and improve customer-facing code reliability.
Who Should Care
Summary TLDR
DAJ trains a reasoning-based LLM judge using bi-level data reweighting so the judge focuses on hard, in-distribution, and trajectory-aligned examples. The judge produces step-by-step code evaluations and is trained with verifiable (auto-checkable) rewards. On LiveCodeBench DAJ reaches 84.7% pass@1 (± reported baselines), improving hardest-case pass rates (73.8% vs 67.2) and matching or beating prior test-time scaling methods on BigCodeBench (35.9%). The approach is plug-in: keep your policy model, add DAJ as an inference-time judge, and gain selection quality without changing the base model.
Problem Statement
Training LLM judges for Best-of-N selection fails in practice because training data differ from test-time data in three ways: (1) easy problems dominate, (2) task distributions shift over time and by platform, and (3) training trajectories often come from cheaper models that behave differently than inference-time models. DAJ learns importance weights over training data (domain- or instance-level) via a bi-level optimization that optimizes judge generalization on a held-out meta set aligned to the test distribution.
Main Contribution
A bi-level data-reweighting framework that learns domain- or instance-level weights to train an LLM judge focused on hard, in-distribution, and trajectory-aligned examples.
A reasoning-based judge that generates step-by-step verification before selecting candidates, trained with verifiable rewards (no human reasoning labels required).
Key Findings
DAJ improves overall pass@1 on LiveCodeBench to 84.7%.
DAJ lifts hard-problem pass rates substantially.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| LiveCodeBench overall pass@1 | 84.7% | o4-mini (high) 77.1% | +7.6 pts | LiveCodeBench (overall) | Table 2; Section 5.2 | Table 2 |
| LiveCodeBench hard split pass@1 | 73.8% | Best baseline 67.2% | +6.6 pts | LiveCodeBench (hard) | Table 2; Section 5.2 | Table 2 |
What To Try In 7 Days
Build a small meta set aligned to your target tasks (recent or hardest problems) and hold it out for validation.
Fine-tune a lightweight reasoning judge with LoRA and pairwise prompts that produce step-by-step analysis.
Implement instance-net reweighting: train an MLP that predicts sample weights from loss and tune meta-learning rates via one-step unrolling.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires a held-out meta set that closely matches test-time tasks; constructing it needs care.
Adds inference latency: the judge runs multi-round pairwise reasoning for Best-of-N selection.
When Not To Use
When you cannot afford extra inference-time compute or latency for judge reasoning.
When no reliable meta-set can be built that matches target tasks.
Failure Modes
Judge may overfit to meta set if meta set is too small or unrepresentative.
Reweighting can upweight noisy but meta-correlated samples, amplifying dataset artifacts.

