MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

February 18, 20258 min

Overview

Decision SnapshotNeeds Validation

The method is practical: it converts inference compute to better judgments via MCTS and simulated tests; ablations show reward and node selection are key drivers of gains.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti

Links

Abstract / PDF / Data

Why It Matters For Business

MCTS-Judge can greatly improve automated code-evaluation accuracy using smaller models and fewer tokens, cutting evaluation cost and reducing dependence on handcrafted tests while providing richer explanations for CI/QA workflows.

Who Should Care

Summary TLDR

This paper introduces MCTS-Judge: a practical framework that runs a Monte Carlo Tree Search (MCTS) around an LLM at test time to improve automated code correctness judgments. It adds a global-local node selection (UCT + LLM self-assessment) and a fully LLM-driven simulated execution reward (auto-generated test cases + LLM-as-interpreter). Across three benchmarks and five base models, MCTS-Judge boosts accuracy (e.g., DeepSeek-Coder-V2-16B: APPS 41.0% → 80.0%), works with smaller models and fewer tokens than some large reasoning models, and benefits from increasing tree depth, rollouts, and test-case sampling.

Problem Statement

LLM-as-a-Judge is cheap but unreliable for reasoning-heavy tasks like code evaluation. The paper asks: can we improve judge reliability by adding test-time compute (System-2 style reasoning) instead of larger models or retraining?

Main Contribution

MCTS-Judge: wrap an LLM with Monte Carlo Tree Search to decompose code-evaluation into multiple subtasks and build multi-step reasoning trajectories.

Global-local node selection: combine UCT (Upper Confidence Bound for Trees) with LLM-driven self-assessment to balance global exploration and local trajectory refinement.

Key Findings

MCTS-Judge raised DeepSeek-Coder-V2-16B-Instruct accuracy on APPS from 41.0% to 80.0%.

NumbersAPPS: 41.0%80.0%

Practical UseYou can dramatically improve code-judgment accuracy by adding MCTS-based test-time compute to an existing LLM instead of retraining a bigger model.

Evidence RefTable 1, Sec.4.2

Average accuracy improved by 14.34 percentage points across five base models and three benchmarks when using MCTS-Judge.

NumbersAverage improvement: +14.34 pp

Practical UseThe approach is model-agnostic — try it as a plug-in judge for different LLMs to get consistent gains.

Evidence RefSec.4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy80.0%41.0% (Vanilla)+39.0 ppAPPSMCTS-Judge raises DeepSeek-Coder 41.0% → 80.0% on APPSTable 1, Sec.4.2
Accuracy14.34 ppBase models without MCTS-Judge+14.34 ppAverage across 5 LLMs on 3 benchmarksAverage improvement quoted in paperSec.4.2

What To Try In 7 Days

Wrap your current code-evaluation LLM with an MCTS controller and test on a small APPS/HumanEval-X subset.

Replace your vote-based judge with a simulated-execution reward: auto-generate 3 unit test cases and have the LLM act as an interpreter.

Compare token usage and accuracy vs your current judge to estimate cost trade-offs (measure reasoning tokens).

Agent Features

Memory
Trajectory history (actions and outcomes) used for self-assessment
Planning
Monte Carlo Tree Search planningGlobal-local node selection (UCT + LLM self-assessment)
Tool Use
LLM used as evaluator and interpreterGPT-4o for test-case generation
Frameworks
UCTSimulated execution reward
Is Agentic

Yes

Architectures
MCTS + LLMUCT-guided search

Optimization Features

Token Efficiency
Accuracy
Infra Optimization
Experiments run on single H100 80GB; hyperparameters tuned for this environment
System Optimization
LoRA
Inference Optimization
Test-time compute scaling via MCTS (tree depth, rollouts, test-case sampling)Weighted sampling of trajectories to focus compute on high-reward paths

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

APPS (public)HumanEval-X (public)BigCodeBench (public)

Risks & Boundaries

Limitations

Requires reliable test-case generation (paper uses GPT-4o); weak test cases limit reward fidelity.

Simulated execution relies on the LLM-as-interpreter and may inherit hallucination risks.

When Not To Use

When you can run real, sandboxed code execution and prefer ground-truth execution.

In hard real-time systems where added inference latency is unacceptable.

Failure Modes

False positives if generated test cases miss corner cases.

LLM hallucinations during simulated execution lead to incorrect rewards.

Core Entities

Models

DeepSeek-Coder-V2-16B-InstructQwen-QwQ-32BGPT-o1-previewGPT-o1-miniQwen2.5-Coder-14BMistralai-Codestral22BLlama-3.1-8B-InstructGPT-4o-miniGPT-4o (used for test-case generation)

Metrics

AccuracyReasoning tokens (# tokens)

Datasets

APPSHumanEval-XBigCodeBench

Benchmarks

APPSHumanEval-XBigCodeBench