Use an LLM to judge each reasoning step and guide stochastic beam search to reduce error accumulation

May 1, 20237 min

Overview

Decision SnapshotNeeds Validation

The method is practical for batch or offline multi-step tasks when you can access token logits and accept extra compute; gains are largest on long chains and structured reasoning.

Citations10

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, Qizhe Xie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product relies on long multi-step reasoning (math QA, multi-hop extraction, program generation), adding a step-level LLM evaluator can raise correctness without model fine-tuning; expect higher compute and need for token-level access.

Who Should Care

Summary TLDR

The paper adds a stepwise self-evaluation signal to multi-step decoding. At each intermediate reasoning step, the same or another LLM scores whether the step looks correct. That correctness score is combined with the LM's generation probability to guide a stochastic beam-search over chains. The approach improves few-shot reasoning accuracy on arithmetic, symbolic and some commonsense tasks (notably for longer chains) but raises compute and needs access to token logits. Code is released.

Problem Statement

Long multi-step reasoning chains let errors accumulate and make the search space huge. Existing sampling or majority-vote fixes act at the final answer level and do not steer generation step-by-step. This paper seeks a lightweight, prompt-based way to control per-step correctness during decoding.

Main Contribution

A stepwise self-evaluation constraint that scores the correctness of each intermediate reasoning step using LLM prompts (multiple-choice style).

A stochastic beam-search decoding algorithm that combines generation probability and stepwise correctness to guide search over reasoning chains.

Key Findings

Self-evaluation guided decoding improves few-shot accuracy on major benchmarks with Codex backbone

NumbersGSM8K +6.34%; AQuA +9.56%; StrategyQA +5.46%

Practical UseExpect single-digit absolute accuracy gains on math and multi-hop QA when you can run stepwise evaluation with a strong LLM.

Evidence RefAbstract; Table 1

Method gives better performance for longer reasoning chains

NumbersGSM8K gains by reasoning length: +5.49% (<7 steps), +7.82% (79], +9.78% (≥9)

Practical UseApply stepwise evaluation especially for tasks that need many intermediate steps (multi-hop math, complex planning).

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy85.5%PAL + self-consistency 80.4%+5.1ppGSM8KOurs-PAL 85.5% vs PAL,SC 80.4% (Table 1)Table 1
Accuracy64.2%PAL,SC 54.7% / other baseline 58.6%+9.56% (reported abstract)AQuAOurs-PAL 64.2% (Table 1, Abstract)Table 1; Abstract

What To Try In 7 Days

Implement a simple multiple-choice self-eval prompt and evaluate it on a small set of chain-of-thought examples.

Run stochastic beam search with k=3–5 and n=2 rollouts (cheap setting) to compare to plain sampling/self-consistency.

Tune sampling temperature (γ) and sampling-decay (α) with validation: start γ∈[0.4,0.8], α=0.5, τ small for CoT, larger for PAL-like programs.

Optimization Features

Token Efficiency
tradeoff tuning (n, k, τ) to reduce tokens

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

GSM8K, AQuA, SVAMP, ASDiv, TabMWP, StrategyQA, CommonsenseQA (public benchmarks referenced in paper)

Risks & Boundaries

Limitations

Requires access to token-level likelihoods (logits), so not directly usable with some commercial APIs.

Additional compute from rollouts and evaluation can be substantial if n and k are large.

When Not To Use

APIs that don't return token logits (e.g., some ChatGPT/GPT-4 API variants).

Very low-latency or tight-cost production constraints.

Failure Modes

Evaluator gives high scores to plausible but logically flawed chains (false positives).

Evaluator is sensitive to small wording or variable-name changes and can under-score correct steps.

Core Entities

Models

Codex (code-davinci-002)Llama-2 (13B)ChatGPT (gpt-3.5-turbo)GPT-4

Metrics

Accuracytoken cost (# tokens)

Datasets

GSM8KAQuASVAMPASDivTabMWPBIG-Bench Date UnderstandingBIG-Bench Object CountingCommonsenseQAStrategyQASports Understanding (BIG-Bench)

Benchmarks

Arithmetic reasoningSymbolic reasoning (BIG-Bench subsets)Commonsense reasoning