Overview
Production Readiness
0.5
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
MCTSr lets smaller open LLMs solve many multi-step math tasks nearly as well as large closed models, reducing reliance on costly closed APIs for arithmetic and structured reasoning workloads.
Summary TLDR
The paper introduces MCT Self-Refine (MCTSr): an algorithm that wraps a language model inside Monte Carlo Tree Search (MCTS), using iterative self-feedback and a constrained self-scoring step to refine mathematical solutions. With LLaMA3-8B, MCTSr improves accuracy steadily with more rollouts — e.g., MATH overall jumps from 24.36% (Zero-Shot CoT) to 58.24% (8-rollouts). On GSM8K it reaches 96.66% (8-rollouts), close to GPT-4's 97.1% on that benchmark. Gains are larger on easier items and show ceilings on the hardest benchmarks.
Problem Statement
Large language models still make logical or numeric errors on multi-step math problems. The paper asks: can we combine MCTS (systematic search) with model-driven self-refinement and self-evaluation to make a small open model (LLaMA3-8B) solve Olympiad-level math more reliably?
Main Contribution
MCTSr algorithm: integrate MCTS with iterative self-refine and constrained self-reward to search answer variants.
Dynamic pruning and UCT/UCB adaptations to suit stochastic LLM outputs and an unbounded refine action space.
Extensive empirical evaluation on GSM8K, GSM-Hard, MATH, AIME, Math Odyssey, and OlympiadBench showing large accuracy gains with more rollouts. Code released at the cited GitHub.
Key Findings
More MCTSr rollouts steadily improve accuracy on math benchmarks.
LLaMA3-8B with 8-rollouts MCTSr reaches near state-of-the-art on standard arithmetic benchmarks.
Performance still caps on very hard or Olympiad-level problems.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
benchmark comparison
Who Should Care
What To Try In 7 Days
Run MCTSr with LLaMA3-8B on a representative arithmetic dataset (start with 4 then 8 rollouts).
Add the paper's self-reward prompt and full-score suppression to your evaluation loop.
Measure cost vs accuracy and decide rollout budget for production cases.
Agent Features
Memory
- short-term search tree (node history)
Planning
- Monte Carlo Tree Search
- UCT/UCB selection
Tool Use
- self-refine (iterative critique)
- self-evaluation (model-scored rewards)
Frameworks
- MCTSr
Is Agentic
true
Architectures
- single-agent
Optimization Features
Inference Optimization
- dynamic pruning (child cap + Q-based criterion)
Reproducibility
Data Urls
- GSM8K, GSM-Hard, MATH, AIME, Math Odyssey, OlympiadBench (public benchmarks cited)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Preliminary work: components need broader ablation and tuning.
- High compute cost: multiple rollouts multiply inference cost and latency.
- Ceiling on hardest problems: large remaining gap on top-tier Olympiad items.
- Relies on self-reward scoring that can be smooth or biased without constraints.
When Not To Use
- When you need low-latency or cheap inference (real-time apps).
- When absolute correctness is mandatory without human review.
- For tasks outside structured multi-step reasoning without further validation.
Failure Modes
- Self-reward smoothing or bias reduces discrimination between variants.
- Search can waste compute on low-quality refinements if pruning thresholds mis-set.
- Diminishing returns on very hard problems despite more rollouts.
Core Entities
Models
- LLaMA3-8B
- GPT-4
- Claude 3
- Gemini 1.5-Pro
Metrics
- success rate
- Accuracy
- solved count
Datasets
- GSM8K
- GSM-Hard
- MATH
- AIME
- Math Odyssey
- OlympiadBench
Benchmarks
- GSM8K
- MATH
- Math Odyssey
- AIME
- OlympiadBench

