Use Monte Carlo Tree Self-Refine (MCTSr) to boost LLaMA-3 8B on hard math problems

June 11, 20246 min

Overview

Decision SnapshotReady For Pilot

Method shows clear, repeatable gains on public math benchmarks using an open 8B model, but it needs compute for rollouts and shows limits on very hard Olympiad items.

Citations6

Evidence Strength0.75

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MCTSr lets smaller open LLMs solve many multi-step math tasks nearly as well as large closed models, reducing reliance on costly closed APIs for arithmetic and structured reasoning workloads.

Who Should Care

Summary TLDR

The paper introduces MCT Self-Refine (MCTSr): an algorithm that wraps a language model inside Monte Carlo Tree Search (MCTS), using iterative self-feedback and a constrained self-scoring step to refine mathematical solutions. With LLaMA3-8B, MCTSr improves accuracy steadily with more rollouts — e.g., MATH overall jumps from 24.36% (Zero-Shot CoT) to 58.24% (8-rollouts). On GSM8K it reaches 96.66% (8-rollouts), close to GPT-4's 97.1% on that benchmark. Gains are larger on easier items and show ceilings on the hardest benchmarks.

Problem Statement

Large language models still make logical or numeric errors on multi-step math problems. The paper asks: can we combine MCTS (systematic search) with model-driven self-refinement and self-evaluation to make a small open model (LLaMA3-8B) solve Olympiad-level math more reliably?

Main Contribution

MCTSr algorithm: integrate MCTS with iterative self-refine and constrained self-reward to search answer variants.

Dynamic pruning and UCT/UCB adaptations to suit stochastic LLM outputs and an unbounded refine action space.

Key Findings

More MCTSr rollouts steadily improve accuracy on math benchmarks.

NumbersMATH overall: 24.36%58.24% (Zero-Shot CoT → 8-rollouts)

Practical UseIf you can afford extra compute, run multi-rollout MCTSr (4→8 rollouts) to get large gains on multi-step math tasks.

Evidence RefTable 2; Section 4.3

LLaMA3-8B with 8-rollouts MCTSr reaches near state-of-the-art on standard arithmetic benchmarks.

NumbersGSM8K: 74.07%96.66% (Zero-Shot CoT → 8-rollouts); GPT-4 GSM8K: 97.1%

Practical UseYou can get GPT-4-like GSM8K accuracy using an 8B open model plus MCTSr, saving on model licensing while keeping high arithmetic performance.

Evidence RefTable 1 and Table 4; Section 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy96.66%74.07% (Zero-Shot CoT)+22.59 ppGSM8K (1319 examples)Table 1; Section 4.2Table 1
Accuracy45.49%25.47% (Zero-Shot CoT)+20.02 ppGSM-Hard (1319 examples)Table 1; Section 4.2Table 1

What To Try In 7 Days

Run MCTSr with LLaMA3-8B on a representative arithmetic dataset (start with 4 then 8 rollouts).

Add the paper's self-reward prompt and full-score suppression to your evaluation loop.

Measure cost vs accuracy and decide rollout budget for production cases.

Agent Features

Memory
short-term search tree (node history)
Planning
Monte Carlo Tree SearchUCT/UCB selection
Tool Use
self-refine (iterative critique)self-evaluation (model-scored rewards)
Frameworks
MCTSr
Is Agentic

Yes

Architectures
single-agent

Optimization Features

Inference Optimization
dynamic pruning (child cap + Q-based criterion)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

GSM8K, GSM-Hard, MATH, AIME, Math Odyssey, OlympiadBench (public benchmarks cited)

Risks & Boundaries

Limitations

Preliminary work: components need broader ablation and tuning.

High compute cost: multiple rollouts multiply inference cost and latency.

When Not To Use

When you need low-latency or cheap inference (real-time apps).

When absolute correctness is mandatory without human review.

Failure Modes

Self-reward smoothing or bias reduces discrimination between variants.

Search can waste compute on low-quality refinements if pruning thresholds mis-set.

Core Entities

Models

LLaMA3-8BGPT-4Claude 3Gemini 1.5-Pro

Metrics

success rateAccuracysolved count

Datasets

GSM8KGSM-HardMATHAIMEMath OdysseyOlympiadBench

Benchmarks

GSM8KMATHMath OdysseyAIMEOlympiadBench