Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
10
Why It Matters For Business
If your product relies on long multi-step reasoning (math QA, multi-hop extraction, program generation), adding a step-level LLM evaluator can raise correctness without model fine-tuning; expect higher compute and need for token-level access.
Summary TLDR
The paper adds a stepwise self-evaluation signal to multi-step decoding. At each intermediate reasoning step, the same or another LLM scores whether the step looks correct. That correctness score is combined with the LM's generation probability to guide a stochastic beam-search over chains. The approach improves few-shot reasoning accuracy on arithmetic, symbolic and some commonsense tasks (notably for longer chains) but raises compute and needs access to token logits. Code is released.
Problem Statement
Long multi-step reasoning chains let errors accumulate and make the search space huge. Existing sampling or majority-vote fixes act at the final answer level and do not steer generation step-by-step. This paper seeks a lightweight, prompt-based way to control per-step correctness during decoding.
Main Contribution
A stepwise self-evaluation constraint that scores the correctness of each intermediate reasoning step using LLM prompts (multiple-choice style).
A stochastic beam-search decoding algorithm that combines generation probability and stepwise correctness to guide search over reasoning chains.
Empirical gains on arithmetic, symbolic and some commonsense benchmarks; detailed hyperparameter and cost analyses.
Practical recipes (prompts, temperature schedules) and public code to reproduce the method.
Key Findings
Self-evaluation guided decoding improves few-shot accuracy on major benchmarks with Codex backbone
Method gives better performance for longer reasoning chains
Comparable or better cost-efficiency under matched token budgets with an open model
Correctness confidence (LLM self-eval) is more discriminatory than raw LM probability for detecting logic errors
There is non-trivial compute overhead when using many rollouts per beam
Results
Accuracy
Accuracy
Accuracy
Cost-efficiency (Llama-2, GSM8K)
Who Should Care
What To Try In 7 Days
Implement a simple multiple-choice self-eval prompt and evaluate it on a small set of chain-of-thought examples.
Run stochastic beam search with k=3–5 and n=2 rollouts (cheap setting) to compare to plain sampling/self-consistency.
Tune sampling temperature (γ) and sampling-decay (α) with validation: start γ∈[0.4,0.8], α=0.5, τ small for CoT, larger for PAL-like programs.
Optimization Features
Token Efficiency
- tradeoff tuning (n, k, τ) to reduce tokens
Reproducibility
Code Urls
Data Urls
- GSM8K, AQuA, SVAMP, ASDiv, TabMWP, StrategyQA, CommonsenseQA (public benchmarks referenced in paper)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Requires access to token-level likelihoods (logits), so not directly usable with some commercial APIs.
- Additional compute from rollouts and evaluation can be substantial if n and k are large.
- Less helpful for short reasoning chains or tasks where LMs already have low-diversity high-confidence outputs.
When Not To Use
- APIs that don't return token logits (e.g., some ChatGPT/GPT-4 API variants).
- Very low-latency or tight-cost production constraints.
- Short single-step QA where stepwise calibration has little effect.
Failure Modes
- Evaluator gives high scores to plausible but logically flawed chains (false positives).
- Evaluator is sensitive to small wording or variable-name changes and can under-score correct steps.
- If the base LM is overconfident and lacks diversity, the guided search may not find better alternatives.
Core Entities
Models
- Codex (code-davinci-002)
- Llama-2 (13B)
- ChatGPT (gpt-3.5-turbo)
- GPT-4
Metrics
- Accuracy
- token cost (# tokens)
Datasets
- GSM8K
- AQuA
- SVAMP
- ASDiv
- TabMWP
- BIG-Bench Date Understanding
- BIG-Bench Object Counting
- CommonsenseQA
- StrategyQA
- Sports Understanding (BIG-Bench)
Benchmarks
- Arithmetic reasoning
- Symbolic reasoning (BIG-Bench subsets)
- Commonsense reasoning

