Use Monte Carlo Tree Self-Refine (MCTSr) to boost LLaMA-3 8B on hard math problems

June 11, 20246 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

6

Authors

Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang

Links

Abstract / PDF

Why It Matters For Business

MCTSr lets smaller open LLMs solve many multi-step math tasks nearly as well as large closed models, reducing reliance on costly closed APIs for arithmetic and structured reasoning workloads.

Summary TLDR

The paper introduces MCT Self-Refine (MCTSr): an algorithm that wraps a language model inside Monte Carlo Tree Search (MCTS), using iterative self-feedback and a constrained self-scoring step to refine mathematical solutions. With LLaMA3-8B, MCTSr improves accuracy steadily with more rollouts — e.g., MATH overall jumps from 24.36% (Zero-Shot CoT) to 58.24% (8-rollouts). On GSM8K it reaches 96.66% (8-rollouts), close to GPT-4's 97.1% on that benchmark. Gains are larger on easier items and show ceilings on the hardest benchmarks.

Problem Statement

Large language models still make logical or numeric errors on multi-step math problems. The paper asks: can we combine MCTS (systematic search) with model-driven self-refinement and self-evaluation to make a small open model (LLaMA3-8B) solve Olympiad-level math more reliably?

Main Contribution

MCTSr algorithm: integrate MCTS with iterative self-refine and constrained self-reward to search answer variants.

Dynamic pruning and UCT/UCB adaptations to suit stochastic LLM outputs and an unbounded refine action space.

Extensive empirical evaluation on GSM8K, GSM-Hard, MATH, AIME, Math Odyssey, and OlympiadBench showing large accuracy gains with more rollouts. Code released at the cited GitHub.

Key Findings

More MCTSr rollouts steadily improve accuracy on math benchmarks.

NumbersMATH overall: 24.36% → 58.24% (Zero-Shot CoT → 8-rollouts)

LLaMA3-8B with 8-rollouts MCTSr reaches near state-of-the-art on standard arithmetic benchmarks.

NumbersGSM8K: 74.07% → 96.66% (Zero-Shot CoT → 8-rollouts); GPT-4 GSM8K: 97.1%

Performance still caps on very hard or Olympiad-level problems.

NumbersAIME: 2.36% → 11.79% (Zero-Shot → 8-rollouts); OlympiadBench: 1.25% → 7.76%

Results

Accuracy

Value96.66%

Baseline74.07% (Zero-Shot CoT)

Accuracy

Value45.49%

Baseline25.47% (Zero-Shot CoT)

Accuracy

Value58.24%

Baseline24.36% (Zero-Shot CoT)

Accuracy

Value49.36%

Baseline17.22% (Zero-Shot CoT)

Accuracy

Value11.79%

Baseline2.36% (Zero-Shot CoT)

benchmark comparison

ValueGPT-4: 97.1% (GSM8K); GPT-4: 73.4% (MATH)

Who Should Care

What To Try In 7 Days

Run MCTSr with LLaMA3-8B on a representative arithmetic dataset (start with 4 then 8 rollouts).

Add the paper's self-reward prompt and full-score suppression to your evaluation loop.

Measure cost vs accuracy and decide rollout budget for production cases.

Agent Features

Memory

  • short-term search tree (node history)

Planning

  • Monte Carlo Tree Search
  • UCT/UCB selection

Tool Use

  • self-refine (iterative critique)
  • self-evaluation (model-scored rewards)

Frameworks

  • MCTSr

Is Agentic

true

Architectures

  • single-agent

Optimization Features

Inference Optimization

  • dynamic pruning (child cap + Q-based criterion)

Reproducibility

Data Urls

  • GSM8K, GSM-Hard, MATH, AIME, Math Odyssey, OlympiadBench (public benchmarks cited)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Preliminary work: components need broader ablation and tuning.
  • High compute cost: multiple rollouts multiply inference cost and latency.
  • Ceiling on hardest problems: large remaining gap on top-tier Olympiad items.
  • Relies on self-reward scoring that can be smooth or biased without constraints.

When Not To Use

  • When you need low-latency or cheap inference (real-time apps).
  • When absolute correctness is mandatory without human review.
  • For tasks outside structured multi-step reasoning without further validation.

Failure Modes

  • Self-reward smoothing or bias reduces discrimination between variants.
  • Search can waste compute on low-quality refinements if pruning thresholds mis-set.
  • Diminishing returns on very hard problems despite more rollouts.

Core Entities

Models

  • LLaMA3-8B
  • GPT-4
  • Claude 3
  • Gemini 1.5-Pro

Metrics

  • success rate
  • Accuracy
  • solved count

Datasets

  • GSM8K
  • GSM-Hard
  • MATH
  • AIME
  • Math Odyssey
  • OlympiadBench

Benchmarks

  • GSM8K
  • MATH
  • Math Odyssey
  • AIME
  • OlympiadBench