Use an LLM to self-evaluate during MCTS and cluster answers to improve multi-step reasoning without extra reward models

June 9, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Mengsong Wu, Di Zhang, Yuqiang Li, Dongzhan Zhou, Wenliang Chen

Links

Abstract / PDF

Why It Matters For Business

SELT improves multi-step reasoning and tool-call accuracy without costly fine-tuning, so teams can boost correctness on complex QA and agent tasks by running an intelligent search layer at inference time.

Summary TLDR

SELT modifies Monte Carlo Tree Search (MCTS) for LLM inference by (1) replacing external reward models with LLM self-evaluation via a Bayesian-adjusted UCT score and (2) decomposing tasks into atomic subtasks and spectral-clustering simulated answers to pick representative responses. Running SELT with Llama-3.1-8B (100 search steps) improves accuracy and F1 on sampled MMLU and Seal-Tools tasks versus 1-shot, CoT, and vanilla MCTS, at the cost of higher compute during search. Code is available.

Problem Statement

LLMs struggle on multi-step or tool-using reasoning when single prompts or chain-of-thought templates are insufficient. Prior MCTS approaches need external reward models and fine-tuning, which adds cost and domain bias. SELT aims to guide MCTS using the LLM itself and reduce redundant/low-quality reasoning paths.

Main Contribution

Self-evaluation MCTS: replace external reward models by scoring candidates with the LLM and a Bayesian-adjusted UCT.

Task decomposition + modes: break problems into atomic LLM subtasks (T/F, Choice, FITB, SA) and four inference modes (Learn, Think, Mimic, Recite).

Semantic clustering: spectral clustering of simulated answers per node and using cluster representatives to stabilize evaluation and reduce redundancy.

Key Findings

On MMLU (selected domain splits), SELT (sentence-level) raises solved-rate compared to 1-shot CoT

NumbersMathematics: SELT Both 62% vs 1-shot CoT 49% (Table 2)

On Seal-Tools single-tool calling, SELT improves end-task F1

NumbersSingle-tool F1: SELT (Picked, Sα) 87.67 vs 1-shot 83.26 (+4.41) (Table 3)

Sentence-level (generate next sentence) search outperforms response-level search

NumbersMultiple reported splits show sentence-level SELT > response-level SELT (Tables 2–3)

Results

MMLU — Mathematics (sentence-level, 'Both')

Value62.0%

Baseline1-shot CoT 49.0%

Seal-Tools — Single-tool F1 (Picked, Sα)

Value87.67

Baseline1-shot F1 83.26

Seal-Tools — Multiple-tool F1 (Picked, Sα)

Value94.45

Baseline1-shot F1 91.46

Who Should Care

What To Try In 7 Days

Run SELT with Llama-3.1-8B-Instruct on a small task subset using vLLM and T=100 to reproduce gains.

Use sentence-level decomposition and enable clustering (spectral clustering + TF-IDF) to stabilize answers.

Compare three settings quickly: raw MCTS (S_raw), exploitation-modified (Sα), and Sα+β to see trade-offs.

Agent Features

Planning

  • MCTS-based planning
  • LoRA

Tool Use

  • evaluates and calls external tools (tested on Seal-Tools)

Frameworks

  • vLLM

Is Agentic

true

Architectures

  • binary-tree MCTS

Optimization Features

Infra Optimization

  • uses vLLM for faster inference

System Optimization

  • binary tree to control branching and search space

Inference Optimization

  • prioritize deeper expansion (50% bias to best child)
  • UCT exploitation modified with Bayesian averaging (µ_Tree, C_β)
  • LoRA

Reproducibility

Data Urls

  • MMLU (Hendrycks et al., 2020)
  • Seal-Tools (Wu et al., 2024)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on the LLM's self-evaluation which can be biased or inaccurate and may propagate errors.
  • High computational cost: search with T=100 is slower than single-prompt methods and may not suit low-latency requirements.
  • Experiments use selected splits of MMLU and subsets of Seal-Tools; generality to other open-ended domains is not shown.

When Not To Use

  • Real-time or low-latency services where many search steps add unacceptable delay.
  • Tasks where the LLM is known to self-evaluate poorly or where external ground truth is required for safety-critical decisions.
  • Very large action spaces where binary-tree expansion or 100 steps is insufficient or too costly.

Failure Modes

  • Self-evaluation bias causes the tree to favor confidently wrong branches.
  • Clustering may group distinct correct answers, hiding valid minority solutions.
  • Compute cost leads to infeasible scaling on large datasets or many concurrent requests.

Core Entities

Models

  • Llama-3.1-8B-Instruct

Metrics

  • Accuracy
  • precision
  • recall
  • F1

Datasets

  • MMLU (selected splits: abstract algebra, college physics, college chemistry)
  • Seal-Tools (Seal-Tools / Seal-Tools subset)

Benchmarks

  • MMLU
  • Seal-Tools