Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
DAJ improves which sampled code solution gets chosen at inference, raising accuracy especially on hard cases without changing core models; that can cut debugging cost and improve customer-facing code reliability.
Summary TLDR
DAJ trains a reasoning-based LLM judge using bi-level data reweighting so the judge focuses on hard, in-distribution, and trajectory-aligned examples. The judge produces step-by-step code evaluations and is trained with verifiable (auto-checkable) rewards. On LiveCodeBench DAJ reaches 84.7% pass@1 (± reported baselines), improving hardest-case pass rates (73.8% vs 67.2) and matching or beating prior test-time scaling methods on BigCodeBench (35.9%). The approach is plug-in: keep your policy model, add DAJ as an inference-time judge, and gain selection quality without changing the base model.
Problem Statement
Training LLM judges for Best-of-N selection fails in practice because training data differ from test-time data in three ways: (1) easy problems dominate, (2) task distributions shift over time and by platform, and (3) training trajectories often come from cheaper models that behave differently than inference-time models. DAJ learns importance weights over training data (domain- or instance-level) via a bi-level optimization that optimizes judge generalization on a held-out meta set aligned to the test distribution.
Main Contribution
A bi-level data-reweighting framework that learns domain- or instance-level weights to train an LLM judge focused on hard, in-distribution, and trajectory-aligned examples.
A reasoning-based judge that generates step-by-step verification before selecting candidates, trained with verifiable rewards (no human reasoning labels required).
Comprehensive experiments showing DAJ improves Best-of-N selection on LiveCodeBench and BigCodeBench and across multiple policy models.
Key Findings
DAJ improves overall pass@1 on LiveCodeBench to 84.7%.
DAJ lifts hard-problem pass rates substantially.
DAJ matches or edges prior methods on BigCodeBench.
All reweighting strategies help; instance-net performs best in experiments.
Results
LiveCodeBench overall pass@1
LiveCodeBench hard split pass@1
BigCodeBench overall pass@1
Average across policy models (pass@1)
Who Should Care
What To Try In 7 Days
Build a small meta set aligned to your target tasks (recent or hardest problems) and hold it out for validation.
Fine-tune a lightweight reasoning judge with LoRA and pairwise prompts that produce step-by-step analysis.
Implement instance-net reweighting: train an MLP that predicts sample weights from loss and tune meta-learning rates via one-step unrolling.
Optimization Features
Training Optimization
- bi-level data reweighting
- preference optimization (DPO, KTO, ORPO)
- GRPO
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires a held-out meta set that closely matches test-time tasks; constructing it needs care.
- Adds inference latency: the judge runs multi-round pairwise reasoning for Best-of-N selection.
- Effectiveness depends on the quality of candidate-generation policies; poor base models limit selection gains.
When Not To Use
- When you cannot afford extra inference-time compute or latency for judge reasoning.
- When no reliable meta-set can be built that matches target tasks.
- If candidate generators are extremely weak and rarely produce correct solutions.
Failure Modes
- Judge may overfit to meta set if meta set is too small or unrepresentative.
- Reweighting can upweight noisy but meta-correlated samples, amplifying dataset artifacts.
- Performance gains shrink for policy models that rarely produce correct candidates (low ceiling).
Core Entities
Models
- Qwen3-Coder-30B-A3B
- Qwen2.5-Coder-32B
- DeepSeek V3.2 Speciale
- o4-mini (high)
- Qwen2.5-Coder-14B (fine-tuned judge)
- Qwen3-1.7B (judge RL backbone)
Metrics
- Accuracy
Datasets
- LiveCodeBench
- BigCodeBench
Benchmarks
- LiveCodeBench
- BigCodeBench

