Use an LLM to judge each reasoning step and guide stochastic beam search to reduce error accumulation

May 1, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

10

Authors

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, Qizhe Xie

Links

Abstract / PDF

Why It Matters For Business

If your product relies on long multi-step reasoning (math QA, multi-hop extraction, program generation), adding a step-level LLM evaluator can raise correctness without model fine-tuning; expect higher compute and need for token-level access.

Summary TLDR

The paper adds a stepwise self-evaluation signal to multi-step decoding. At each intermediate reasoning step, the same or another LLM scores whether the step looks correct. That correctness score is combined with the LM's generation probability to guide a stochastic beam-search over chains. The approach improves few-shot reasoning accuracy on arithmetic, symbolic and some commonsense tasks (notably for longer chains) but raises compute and needs access to token logits. Code is released.

Problem Statement

Long multi-step reasoning chains let errors accumulate and make the search space huge. Existing sampling or majority-vote fixes act at the final answer level and do not steer generation step-by-step. This paper seeks a lightweight, prompt-based way to control per-step correctness during decoding.

Main Contribution

A stepwise self-evaluation constraint that scores the correctness of each intermediate reasoning step using LLM prompts (multiple-choice style).

A stochastic beam-search decoding algorithm that combines generation probability and stepwise correctness to guide search over reasoning chains.

Empirical gains on arithmetic, symbolic and some commonsense benchmarks; detailed hyperparameter and cost analyses.

Practical recipes (prompts, temperature schedules) and public code to reproduce the method.

Key Findings

Self-evaluation guided decoding improves few-shot accuracy on major benchmarks with Codex backbone

NumbersGSM8K +6.34%; AQuA +9.56%; StrategyQA +5.46%

Method gives better performance for longer reasoning chains

NumbersGSM8K gains by reasoning length: +5.49% (<7 steps), +7.82% (7–9], +9.78% (≥9)

Comparable or better cost-efficiency under matched token budgets with an open model

NumbersLlama-2: Ours 46.1% (12.6k tokens) vs baseline 41.8% (13.9k tokens)

Correctness confidence (LLM self-eval) is more discriminatory than raw LM probability for detecting logic errors

NumbersDistributional separation shown in Figures 5 and 11 (correct vs incorrect)

There is non-trivial compute overhead when using many rollouts per beam

NumbersSingle-chain PAL setup used n=16 rollouts and costs ≈3× more than self-consistency on GSM8K

Results

Accuracy

Value85.5%

BaselinePAL + self-consistency 80.4%

Accuracy

Value64.2%

BaselinePAL,SC 54.7% / other baseline 58.6%

Accuracy

Value77.2%

BaselineCoT 73.2%

Cost-efficiency (Llama-2, GSM8K)

Value46.1% @ 12.6k tokens

Baseline41.8% @ 13.9k tokens

Who Should Care

What To Try In 7 Days

Implement a simple multiple-choice self-eval prompt and evaluate it on a small set of chain-of-thought examples.

Run stochastic beam search with k=3–5 and n=2 rollouts (cheap setting) to compare to plain sampling/self-consistency.

Tune sampling temperature (γ) and sampling-decay (α) with validation: start γ∈[0.4,0.8], α=0.5, τ small for CoT, larger for PAL-like programs.

Optimization Features

Token Efficiency

  • tradeoff tuning (n, k, τ) to reduce tokens

Reproducibility

Data Urls

  • GSM8K, AQuA, SVAMP, ASDiv, TabMWP, StrategyQA, CommonsenseQA (public benchmarks referenced in paper)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Requires access to token-level likelihoods (logits), so not directly usable with some commercial APIs.
  • Additional compute from rollouts and evaluation can be substantial if n and k are large.
  • Less helpful for short reasoning chains or tasks where LMs already have low-diversity high-confidence outputs.

When Not To Use

  • APIs that don't return token logits (e.g., some ChatGPT/GPT-4 API variants).
  • Very low-latency or tight-cost production constraints.
  • Short single-step QA where stepwise calibration has little effect.

Failure Modes

  • Evaluator gives high scores to plausible but logically flawed chains (false positives).
  • Evaluator is sensitive to small wording or variable-name changes and can under-score correct steps.
  • If the base LM is overconfident and lacks diversity, the guided search may not find better alternatives.

Core Entities

Models

  • Codex (code-davinci-002)
  • Llama-2 (13B)
  • ChatGPT (gpt-3.5-turbo)
  • GPT-4

Metrics

  • Accuracy
  • token cost (# tokens)

Datasets

  • GSM8K
  • AQuA
  • SVAMP
  • ASDiv
  • TabMWP
  • BIG-Bench Date Understanding
  • BIG-Bench Object Counting
  • CommonsenseQA
  • StrategyQA
  • Sports Understanding (BIG-Bench)

Benchmarks

  • Arithmetic reasoning
  • Symbolic reasoning (BIG-Bench subsets)
  • Commonsense reasoning