Overview
The method is practical for batch or offline multi-step tasks when you can access token logits and accept extra compute; gains are largest on long chains and structured reasoning.
Citations10
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If your product relies on long multi-step reasoning (math QA, multi-hop extraction, program generation), adding a step-level LLM evaluator can raise correctness without model fine-tuning; expect higher compute and need for token-level access.
Who Should Care
Summary TLDR
The paper adds a stepwise self-evaluation signal to multi-step decoding. At each intermediate reasoning step, the same or another LLM scores whether the step looks correct. That correctness score is combined with the LM's generation probability to guide a stochastic beam-search over chains. The approach improves few-shot reasoning accuracy on arithmetic, symbolic and some commonsense tasks (notably for longer chains) but raises compute and needs access to token logits. Code is released.
Problem Statement
Long multi-step reasoning chains let errors accumulate and make the search space huge. Existing sampling or majority-vote fixes act at the final answer level and do not steer generation step-by-step. This paper seeks a lightweight, prompt-based way to control per-step correctness during decoding.
Main Contribution
A stepwise self-evaluation constraint that scores the correctness of each intermediate reasoning step using LLM prompts (multiple-choice style).
A stochastic beam-search decoding algorithm that combines generation probability and stepwise correctness to guide search over reasoning chains.
Key Findings
Self-evaluation guided decoding improves few-shot accuracy on major benchmarks with Codex backbone
Method gives better performance for longer reasoning chains
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 85.5% | PAL + self-consistency 80.4% | +5.1pp | GSM8K | Ours-PAL 85.5% vs PAL,SC 80.4% (Table 1) | Table 1 |
| Accuracy | 64.2% | PAL,SC 54.7% / other baseline 58.6% | +9.56% (reported abstract) | AQuA | Ours-PAL 64.2% (Table 1, Abstract) | Table 1; Abstract |
What To Try In 7 Days
Implement a simple multiple-choice self-eval prompt and evaluate it on a small set of chain-of-thought examples.
Run stochastic beam search with k=3–5 and n=2 rollouts (cheap setting) to compare to plain sampling/self-consistency.
Tune sampling temperature (γ) and sampling-decay (α) with validation: start γ∈[0.4,0.8], α=0.5, τ small for CoT, larger for PAL-like programs.
Optimization Features
Token Efficiency
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Requires access to token-level likelihoods (logits), so not directly usable with some commercial APIs.
Additional compute from rollouts and evaluation can be substantial if n and k are large.
When Not To Use
APIs that don't return token logits (e.g., some ChatGPT/GPT-4 API variants).
Very low-latency or tight-cost production constraints.
Failure Modes
Evaluator gives high scores to plausible but logically flawed chains (false positives).
Evaluator is sensitive to small wording or variable-name changes and can under-score correct steps.

