MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

February 18, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti

Links

Abstract / PDF

Why It Matters For Business

MCTS-Judge can greatly improve automated code-evaluation accuracy using smaller models and fewer tokens, cutting evaluation cost and reducing dependence on handcrafted tests while providing richer explanations for CI/QA workflows.

Summary TLDR

This paper introduces MCTS-Judge: a practical framework that runs a Monte Carlo Tree Search (MCTS) around an LLM at test time to improve automated code correctness judgments. It adds a global-local node selection (UCT + LLM self-assessment) and a fully LLM-driven simulated execution reward (auto-generated test cases + LLM-as-interpreter). Across three benchmarks and five base models, MCTS-Judge boosts accuracy (e.g., DeepSeek-Coder-V2-16B: APPS 41.0% → 80.0%), works with smaller models and fewer tokens than some large reasoning models, and benefits from increasing tree depth, rollouts, and test-case sampling.

Problem Statement

LLM-as-a-Judge is cheap but unreliable for reasoning-heavy tasks like code evaluation. The paper asks: can we improve judge reliability by adding test-time compute (System-2 style reasoning) instead of larger models or retraining?

Main Contribution

MCTS-Judge: wrap an LLM with Monte Carlo Tree Search to decompose code-evaluation into multiple subtasks and build multi-step reasoning trajectories.

Global-local node selection: combine UCT (Upper Confidence Bound for Trees) with LLM-driven self-assessment to balance global exploration and local trajectory refinement.

Simulated execution reward: generate test cases with GPT-4o, then run masked input-output simulations where the LLM acts as an interpreter; reward trajectories only if simulated outputs match expected outputs.

Extensive evaluation on APPS, HumanEval-X, and BigCodeBench across five LLMs showing large, consistent gains and better robustness without reference code.

Empirical evidence of a test-time scaling law: increasing rollouts, tree depth, and test-case sampling improves accuracy.

Key Findings

MCTS-Judge raised DeepSeek-Coder-V2-16B-Instruct accuracy on APPS from 41.0% to 80.0%.

NumbersAPPS: 41.0% → 80.0%

Average accuracy improved by 14.34 percentage points across five base models and three benchmarks when using MCTS-Judge.

NumbersAverage improvement: +14.34 pp

MCTS-Judge is more token-efficient: with DeepSeek (16B) it used 2,065 reasoning tokens vs o1-preview (~300B) using 5,631 tokens, while achieving higher APPS accuracy (80 vs 75).

Numbers2065 tokens (16B, 80%) vs 5631 tokens (~300B, 75%)

The simulated execution reward outperformed self-consistency and self-evaluation rewards by about 13 percentage points on APPS in ablations.

NumbersRM Ours + UCT+LLM: 80% vs RM SE/SC variants ~65–78%

MCTS-Judge degrades less when reference code is absent (e.g., BigCodeBench drop −6%), while some baselines drop much more.

NumbersBigCodeBench w/o reference: −6% (Ours) vs −22.5% (CodeJudge)

Results

Accuracy

Value80.0%

Baseline41.0% (Vanilla)

Accuracy

Value14.34 pp

BaselineBase models without MCTS-Judge

Reasoning tokens (APPS, DeepSeek w/ MCTS-Judge)

Value2,065 tokens

Baselineo1-preview: 5,631 tokens

Robustness without reference code (BigCodeBench)

Value−6.0% relative drop

BaselineWith reference code

Who Should Care

What To Try In 7 Days

Wrap your current code-evaluation LLM with an MCTS controller and test on a small APPS/HumanEval-X subset.

Replace your vote-based judge with a simulated-execution reward: auto-generate 3 unit test cases and have the LLM act as an interpreter.

Compare token usage and accuracy vs your current judge to estimate cost trade-offs (measure reasoning tokens).

Agent Features

Memory

  • Trajectory history (actions and outcomes) used for self-assessment

Planning

  • Monte Carlo Tree Search planning
  • Global-local node selection (UCT + LLM self-assessment)

Tool Use

  • LLM used as evaluator and interpreter
  • GPT-4o for test-case generation

Frameworks

  • UCT
  • Simulated execution reward

Is Agentic

true

Architectures

  • MCTS + LLM
  • UCT-guided search

Optimization Features

Token Efficiency

  • Accuracy

Infra Optimization

  • Experiments run on single H100 80GB; hyperparameters tuned for this environment

System Optimization

  • LoRA

Inference Optimization

  • Test-time compute scaling via MCTS (tree depth, rollouts, test-case sampling)
  • Weighted sampling of trajectories to focus compute on high-reward paths

Reproducibility

Data Urls

  • APPS (public)
  • HumanEval-X (public)
  • BigCodeBench (public)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires reliable test-case generation (paper uses GPT-4o); weak test cases limit reward fidelity.
  • Simulated execution relies on the LLM-as-interpreter and may inherit hallucination risks.
  • Adds test-time compute and latency compared to single-shot prompts.
  • Performance sensitive to hyperparameters (tree depth, rollouts, test counts).

When Not To Use

  • When you can run real, sandboxed code execution and prefer ground-truth execution.
  • In hard real-time systems where added inference latency is unacceptable.
  • If you cannot afford a reliable test-case generator or are restricted from calling external LLMs.

Failure Modes

  • False positives if generated test cases miss corner cases.
  • LLM hallucinations during simulated execution lead to incorrect rewards.
  • Overfitting judge behavior to the simulated-exec reward design.
  • High variance from insufficient rollouts or too-shallow trees.

Core Entities

Models

  • DeepSeek-Coder-V2-16B-Instruct
  • Qwen-QwQ-32B
  • GPT-o1-preview
  • GPT-o1-mini
  • Qwen2.5-Coder-14B
  • Mistralai-Codestral22B
  • Llama-3.1-8B-Instruct
  • GPT-4o-mini
  • GPT-4o (used for test-case generation)

Metrics

  • Accuracy
  • Reasoning tokens (# tokens)

Datasets

  • APPS
  • HumanEval-X
  • BigCodeBench

Benchmarks

  • APPS
  • HumanEval-X
  • BigCodeBench