Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
MCTS-Judge can greatly improve automated code-evaluation accuracy using smaller models and fewer tokens, cutting evaluation cost and reducing dependence on handcrafted tests while providing richer explanations for CI/QA workflows.
Summary TLDR
This paper introduces MCTS-Judge: a practical framework that runs a Monte Carlo Tree Search (MCTS) around an LLM at test time to improve automated code correctness judgments. It adds a global-local node selection (UCT + LLM self-assessment) and a fully LLM-driven simulated execution reward (auto-generated test cases + LLM-as-interpreter). Across three benchmarks and five base models, MCTS-Judge boosts accuracy (e.g., DeepSeek-Coder-V2-16B: APPS 41.0% → 80.0%), works with smaller models and fewer tokens than some large reasoning models, and benefits from increasing tree depth, rollouts, and test-case sampling.
Problem Statement
LLM-as-a-Judge is cheap but unreliable for reasoning-heavy tasks like code evaluation. The paper asks: can we improve judge reliability by adding test-time compute (System-2 style reasoning) instead of larger models or retraining?
Main Contribution
MCTS-Judge: wrap an LLM with Monte Carlo Tree Search to decompose code-evaluation into multiple subtasks and build multi-step reasoning trajectories.
Global-local node selection: combine UCT (Upper Confidence Bound for Trees) with LLM-driven self-assessment to balance global exploration and local trajectory refinement.
Simulated execution reward: generate test cases with GPT-4o, then run masked input-output simulations where the LLM acts as an interpreter; reward trajectories only if simulated outputs match expected outputs.
Extensive evaluation on APPS, HumanEval-X, and BigCodeBench across five LLMs showing large, consistent gains and better robustness without reference code.
Empirical evidence of a test-time scaling law: increasing rollouts, tree depth, and test-case sampling improves accuracy.
Key Findings
MCTS-Judge raised DeepSeek-Coder-V2-16B-Instruct accuracy on APPS from 41.0% to 80.0%.
Average accuracy improved by 14.34 percentage points across five base models and three benchmarks when using MCTS-Judge.
MCTS-Judge is more token-efficient: with DeepSeek (16B) it used 2,065 reasoning tokens vs o1-preview (~300B) using 5,631 tokens, while achieving higher APPS accuracy (80 vs 75).
The simulated execution reward outperformed self-consistency and self-evaluation rewards by about 13 percentage points on APPS in ablations.
MCTS-Judge degrades less when reference code is absent (e.g., BigCodeBench drop −6%), while some baselines drop much more.
Results
Accuracy
Accuracy
Reasoning tokens (APPS, DeepSeek w/ MCTS-Judge)
Robustness without reference code (BigCodeBench)
Who Should Care
What To Try In 7 Days
Wrap your current code-evaluation LLM with an MCTS controller and test on a small APPS/HumanEval-X subset.
Replace your vote-based judge with a simulated-execution reward: auto-generate 3 unit test cases and have the LLM act as an interpreter.
Compare token usage and accuracy vs your current judge to estimate cost trade-offs (measure reasoning tokens).
Agent Features
Memory
- Trajectory history (actions and outcomes) used for self-assessment
Planning
- Monte Carlo Tree Search planning
- Global-local node selection (UCT + LLM self-assessment)
Tool Use
- LLM used as evaluator and interpreter
- GPT-4o for test-case generation
Frameworks
- UCT
- Simulated execution reward
Is Agentic
true
Architectures
- MCTS + LLM
- UCT-guided search
Optimization Features
Token Efficiency
- Accuracy
Infra Optimization
- Experiments run on single H100 80GB; hyperparameters tuned for this environment
System Optimization
- LoRA
Inference Optimization
- Test-time compute scaling via MCTS (tree depth, rollouts, test-case sampling)
- Weighted sampling of trajectories to focus compute on high-reward paths
Reproducibility
Data Urls
- APPS (public)
- HumanEval-X (public)
- BigCodeBench (public)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires reliable test-case generation (paper uses GPT-4o); weak test cases limit reward fidelity.
- Simulated execution relies on the LLM-as-interpreter and may inherit hallucination risks.
- Adds test-time compute and latency compared to single-shot prompts.
- Performance sensitive to hyperparameters (tree depth, rollouts, test counts).
When Not To Use
- When you can run real, sandboxed code execution and prefer ground-truth execution.
- In hard real-time systems where added inference latency is unacceptable.
- If you cannot afford a reliable test-case generator or are restricted from calling external LLMs.
Failure Modes
- False positives if generated test cases miss corner cases.
- LLM hallucinations during simulated execution lead to incorrect rewards.
- Overfitting judge behavior to the simulated-exec reward design.
- High variance from insufficient rollouts or too-shallow trees.
Core Entities
Models
- DeepSeek-Coder-V2-16B-Instruct
- Qwen-QwQ-32B
- GPT-o1-preview
- GPT-o1-mini
- Qwen2.5-Coder-14B
- Mistralai-Codestral22B
- Llama-3.1-8B-Instruct
- GPT-4o-mini
- GPT-4o (used for test-case generation)
Metrics
- Accuracy
- Reasoning tokens (# tokens)
Datasets
- APPS
- HumanEval-X
- BigCodeBench
Benchmarks
- APPS
- HumanEval-X
- BigCodeBench

