Use an LLM to judge each reasoning step and guide stochastic beam search to reduce error accumulation

Overview

Decision SnapshotNeeds Validation

The method is practical for batch or offline multi-step tasks when you can access token logits and accept extra compute; gains are largest on long chains and structured reasoning.

Citations10

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, Qizhe Xie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product relies on long multi-step reasoning (math QA, multi-hop extraction, program generation), adding a step-level LLM evaluator can raise correctness without model fine-tuning; expect higher compute and need for token-level access.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Engineering Lead

Summary TLDR

The paper adds a stepwise self-evaluation signal to multi-step decoding. At each intermediate reasoning step, the same or another LLM scores whether the step looks correct. That correctness score is combined with the LM's generation probability to guide a stochastic beam-search over chains. The approach improves few-shot reasoning accuracy on arithmetic, symbolic and some commonsense tasks (notably for longer chains) but raises compute and needs access to token logits. Code is released.

Problem Statement

Long multi-step reasoning chains let errors accumulate and make the search space huge. Existing sampling or majority-vote fixes act at the final answer level and do not steer generation step-by-step. This paper seeks a lightweight, prompt-based way to control per-step correctness during decoding.

Main Contribution

A stepwise self-evaluation constraint that scores the correctness of each intermediate reasoning step using LLM prompts (multiple-choice style).

A stochastic beam-search decoding algorithm that combines generation probability and stepwise correctness to guide search over reasoning chains.

Key Findings

Self-evaluation guided decoding improves few-shot accuracy on major benchmarks with Codex backbone

NumbersGSM8K +6.34%; AQuA +9.56%; StrategyQA +5.46%

Practical UseExpect single-digit absolute accuracy gains on math and multi-hop QA when you can run stepwise evaluation with a strong LLM.

Evidence RefAbstract; Table 1

Method gives better performance for longer reasoning chains

NumbersGSM8K gains by reasoning length: +5.49% (<7 steps), +7.82% (7–9], +9.78% (≥9)

Practical UseApply stepwise evaluation especially for tasks that need many intermediate steps (multi-hop math, complex planning).

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	85.5%	PAL + self-consistency 80.4%	+5.1pp	GSM8K	Ours-PAL 85.5% vs PAL,SC 80.4% (Table 1)	Table 1
Accuracy	64.2%	PAL,SC 54.7% / other baseline 58.6%	+9.56% (reported abstract)	AQuA	Ours-PAL 64.2% (Table 1, Abstract)	Table 1; Abstract

What To Try In 7 Days

Implement a simple multiple-choice self-eval prompt and evaluate it on a small set of chain-of-thought examples.

Run stochastic beam search with k=3–5 and n=2 rollouts (cheap setting) to compare to plain sampling/self-consistency.

Tune sampling temperature (γ) and sampling-decay (α) with validation: start γ∈[0.4,0.8], α=0.5, τ small for CoT, larger for PAL-like programs.

Optimization Features

Token Efficiency

tradeoff tuning (n, k, τ) to reduce tokens

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://guideddecoding.github.io/

Data URLs

GSM8K, AQuA, SVAMP, ASDiv, TabMWP, StrategyQA, CommonsenseQA (public benchmarks referenced in paper)

Risks & Boundaries

Limitations

Requires access to token-level likelihoods (logits), so not directly usable with some commercial APIs.

Additional compute from rollouts and evaluation can be substantial if n and k are large.

When Not To Use

APIs that don't return token logits (e.g., some ChatGPT/GPT-4 API variants).

Very low-latency or tight-cost production constraints.

Failure Modes

Evaluator gives high scores to plausible but logically flawed chains (false positives).

Evaluator is sensitive to small wording or variable-name changes and can under-score correct steps.

Core Entities

Models

Codex (code-davinci-002)Llama-2 (13B)ChatGPT (gpt-3.5-turbo)GPT-4

Metrics

Accuracytoken cost (# tokens)

Datasets

GSM8KAQuASVAMPASDivTabMWPBIG-Bench Date UnderstandingBIG-Bench Object CountingCommonsenseQAStrategyQASports Understanding (BIG-Bench)

Benchmarks

Arithmetic reasoningSymbolic reasoning (BIG-Bench subsets)Commonsense reasoning

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Self-evaluation guided decoding improves few-shot accuracy on major benchmarks with Codex backbone

Method gives better performance for longer reasoning chains

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding