Use Monte Carlo Tree Self-Refine (MCTSr) to boost LLaMA-3 8B on hard math problems

Overview

Decision SnapshotReady For Pilot

Method shows clear, repeatable gains on public math benchmarks using an open 8B model, but it needs compute for rollouts and shows limits on very hard Olympiad items.

Citations6

Evidence Strength0.75

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MCTSr lets smaller open LLMs solve many multi-step math tasks nearly as well as large closed models, reducing reliance on costly closed APIs for arithmetic and structured reasoning workloads.

Who Should Care

ML Engineer Product Manager CTO Engineering Lead Data Scientist

Summary TLDR

The paper introduces MCT Self-Refine (MCTSr): an algorithm that wraps a language model inside Monte Carlo Tree Search (MCTS), using iterative self-feedback and a constrained self-scoring step to refine mathematical solutions. With LLaMA3-8B, MCTSr improves accuracy steadily with more rollouts — e.g., MATH overall jumps from 24.36% (Zero-Shot CoT) to 58.24% (8-rollouts). On GSM8K it reaches 96.66% (8-rollouts), close to GPT-4's 97.1% on that benchmark. Gains are larger on easier items and show ceilings on the hardest benchmarks.

Problem Statement

Large language models still make logical or numeric errors on multi-step math problems. The paper asks: can we combine MCTS (systematic search) with model-driven self-refinement and self-evaluation to make a small open model (LLaMA3-8B) solve Olympiad-level math more reliably?

Main Contribution

MCTSr algorithm: integrate MCTS with iterative self-refine and constrained self-reward to search answer variants.

Dynamic pruning and UCT/UCB adaptations to suit stochastic LLM outputs and an unbounded refine action space.

Key Findings

More MCTSr rollouts steadily improve accuracy on math benchmarks.

NumbersMATH overall: 24.36% → 58.24% (Zero-Shot CoT → 8-rollouts)

Practical UseIf you can afford extra compute, run multi-rollout MCTSr (4→8 rollouts) to get large gains on multi-step math tasks.

Evidence RefTable 2; Section 4.3

LLaMA3-8B with 8-rollouts MCTSr reaches near state-of-the-art on standard arithmetic benchmarks.

NumbersGSM8K: 74.07% → 96.66% (Zero-Shot CoT → 8-rollouts); GPT-4 GSM8K: 97.1%

Practical UseYou can get GPT-4-like GSM8K accuracy using an 8B open model plus MCTSr, saving on model licensing while keeping high arithmetic performance.

Evidence RefTable 1 and Table 4; Section 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	96.66%	74.07% (Zero-Shot CoT)	+22.59 pp	GSM8K (1319 examples)	Table 1; Section 4.2	Table 1
Accuracy	45.49%	25.47% (Zero-Shot CoT)	+20.02 pp	GSM-Hard (1319 examples)	Table 1; Section 4.2	Table 1

What To Try In 7 Days

Run MCTSr with LLaMA3-8B on a representative arithmetic dataset (start with 4 then 8 rollouts).

Add the paper's self-reward prompt and full-score suppression to your evaluation loop.

Measure cost vs accuracy and decide rollout budget for production cases.

Agent Features

Memory

short-term search tree (node history)

Planning

Monte Carlo Tree SearchUCT/UCB selection

Tool Use

self-refine (iterative critique)self-evaluation (model-scored rewards)

Frameworks

MCTSr

Is Agentic

Yes

Architectures

single-agent

Optimization Features

Inference Optimization

dynamic pruning (child cap + Q-based criterion)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/trotsky1997/MathBlackBox

Data URLs

GSM8K, GSM-Hard, MATH, AIME, Math Odyssey, OlympiadBench (public benchmarks cited)

Risks & Boundaries

Limitations

Preliminary work: components need broader ablation and tuning.

High compute cost: multiple rollouts multiply inference cost and latency.

When Not To Use

When you need low-latency or cheap inference (real-time apps).

When absolute correctness is mandatory without human review.

Failure Modes

Self-reward smoothing or bias reduces discrimination between variants.

Search can waste compute on low-quality refinements if pruning thresholds mis-set.

Core Entities

Models

LLaMA3-8BGPT-4Claude 3Gemini 1.5-Pro

Metrics

success rateAccuracysolved count

Datasets

GSM8KGSM-HardMATHAIMEMath OdysseyOlympiadBench

Benchmarks

GSM8KMATHMath OdysseyAIMEOlympiadBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

More MCTSr rollouts steadily improve accuracy on math benchmarks.

LLaMA3-8B with 8-rollouts MCTSr reaches near state-of-the-art on standard arithmetic benchmarks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding