MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Overview

Decision SnapshotNeeds Validation

The method is practical: it converts inference compute to better judgments via MCTS and simulated tests; ablations show reward and node selection are key drivers of gains.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti

Links

Abstract / PDF / Data

Why It Matters For Business

MCTS-Judge can greatly improve automated code-evaluation accuracy using smaller models and fewer tokens, cutting evaluation cost and reducing dependence on handcrafted tests while providing richer explanations for CI/QA workflows.

Who Should Care

ML Engineer Engineering Lead Data Scientist CTO Product Manager

Summary TLDR

This paper introduces MCTS-Judge: a practical framework that runs a Monte Carlo Tree Search (MCTS) around an LLM at test time to improve automated code correctness judgments. It adds a global-local node selection (UCT + LLM self-assessment) and a fully LLM-driven simulated execution reward (auto-generated test cases + LLM-as-interpreter). Across three benchmarks and five base models, MCTS-Judge boosts accuracy (e.g., DeepSeek-Coder-V2-16B: APPS 41.0% → 80.0%), works with smaller models and fewer tokens than some large reasoning models, and benefits from increasing tree depth, rollouts, and test-case sampling.

Problem Statement

LLM-as-a-Judge is cheap but unreliable for reasoning-heavy tasks like code evaluation. The paper asks: can we improve judge reliability by adding test-time compute (System-2 style reasoning) instead of larger models or retraining?

Main Contribution

MCTS-Judge: wrap an LLM with Monte Carlo Tree Search to decompose code-evaluation into multiple subtasks and build multi-step reasoning trajectories.

Global-local node selection: combine UCT (Upper Confidence Bound for Trees) with LLM-driven self-assessment to balance global exploration and local trajectory refinement.

Key Findings

MCTS-Judge raised DeepSeek-Coder-V2-16B-Instruct accuracy on APPS from 41.0% to 80.0%.

NumbersAPPS: 41.0% → 80.0%

Practical UseYou can dramatically improve code-judgment accuracy by adding MCTS-based test-time compute to an existing LLM instead of retraining a bigger model.

Evidence RefTable 1, Sec.4.2

Average accuracy improved by 14.34 percentage points across five base models and three benchmarks when using MCTS-Judge.

NumbersAverage improvement: +14.34 pp

Practical UseThe approach is model-agnostic — try it as a plug-in judge for different LLMs to get consistent gains.

Evidence RefSec.4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	80.0%	41.0% (Vanilla)	+39.0 pp	APPS	MCTS-Judge raises DeepSeek-Coder 41.0% → 80.0% on APPS	Table 1, Sec.4.2
Accuracy	14.34 pp	Base models without MCTS-Judge	+14.34 pp	Average across 5 LLMs on 3 benchmarks	Average improvement quoted in paper	Sec.4.2

What To Try In 7 Days

Wrap your current code-evaluation LLM with an MCTS controller and test on a small APPS/HumanEval-X subset.

Replace your vote-based judge with a simulated-execution reward: auto-generate 3 unit test cases and have the LLM act as an interpreter.

Compare token usage and accuracy vs your current judge to estimate cost trade-offs (measure reasoning tokens).

Agent Features

Memory

Trajectory history (actions and outcomes) used for self-assessment

Planning

Monte Carlo Tree Search planningGlobal-local node selection (UCT + LLM self-assessment)

Tool Use

LLM used as evaluator and interpreterGPT-4o for test-case generation

Frameworks

UCTSimulated execution reward

Is Agentic

Yes

Architectures

MCTS + LLMUCT-guided search

Optimization Features

Token Efficiency

Accuracy

Infra Optimization

Experiments run on single H100 80GB; hyperparameters tuned for this environment

System Optimization

LoRA

Inference Optimization

Test-time compute scaling via MCTS (tree depth, rollouts, test-case sampling)Weighted sampling of trajectories to focus compute on high-reward paths

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

APPS (public)HumanEval-X (public)BigCodeBench (public)

Risks & Boundaries

Limitations

Requires reliable test-case generation (paper uses GPT-4o); weak test cases limit reward fidelity.

Simulated execution relies on the LLM-as-interpreter and may inherit hallucination risks.

When Not To Use

When you can run real, sandboxed code execution and prefer ground-truth execution.

In hard real-time systems where added inference latency is unacceptable.

Failure Modes

False positives if generated test cases miss corner cases.

LLM hallucinations during simulated execution lead to incorrect rewards.

Core Entities

Models

DeepSeek-Coder-V2-16B-InstructQwen-QwQ-32BGPT-o1-previewGPT-o1-miniQwen2.5-Coder-14BMistralai-Codestral22BLlama-3.1-8B-InstructGPT-4o-miniGPT-4o (used for test-case generation)

Metrics

AccuracyReasoning tokens (# tokens)

Datasets

APPSHumanEval-XBigCodeBench

Benchmarks

APPSHumanEval-XBigCodeBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MCTS-Judge raised DeepSeek-Coder-V2-16B-Instruct accuracy on APPS from 41.0% to 80.0%.

Average accuracy improved by 14.34 percentage points across five base models and three benchmarks when using MCTS-Judge.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding