Train LLMs with binary feedback on every reasoning step to improve math accuracy and trustworthiness

January 18, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang

Links

Abstract / PDF

Why It Matters For Business

Stepwise binary feedback makes multi-step outputs more reliable and auditable, helping products that require trustworthy reasoning (education, tutoring, automated grading, math assistants). Expect measurable accuracy and traceability gains if you can invest in judge access and iterative finetuning.

Summary TLDR

Step-KTO is a finetuning recipe that uses binary labels both for each intermediate reasoning step and for the final answer. A process reward model (PRM) labels steps as correct/incorrect and a rule-based checker labels final answers; these signals are combined with a Kahneman–Tversky-style value function to guide updates. On math benchmarks (MATH-500, AMC23, AIME24) Step-KTO raises Pass@1 substantially (e.g., 53.4% → 63.2% on an 8B Llama variant) and reduces flawed stepwise reasoning (27.3% → 19.9%). The method needs ground-truth solutions and judge access but yields more reliable, interpretable solutions.

Problem Statement

LLMs can get the right math answer for the wrong internal reasons. Current tuning often optimizes only final correctness. We need a training method that enforces correct intermediate steps as well as correct outcomes so solutions become more faithful and easier to trust.

Main Contribution

Propose Step-KTO: a finetuning objective that combines binary stepwise labels from a PRM with outcome-level binary correctness using a prospect-theory-inspired value function.

Show iterative Step-KTO training yields consistent, cumulative gains in math reasoning across multiple model sizes and datasets.

Provide empirical evidence that adding stepwise supervision reduces erroneous intermediate steps while improving final-answer accuracy on competitive math benchmarks.

Key Findings

Step-KTO increases single-run accuracy on MATH-500 for a Llama-3.1-8B-Instruct seed.

NumbersPass@1 53.4% → 63.2% (8B, M3)

Step-KTO improves larger models too.

NumbersLlama-3.3-70B-Instruct: Pass@1 75.8% → 79.6% (evaluated checkpoints)

Step-KTO reduces correct-final-answer solutions that contain stepwise mistakes.

NumbersError-in-correct-solutions 27.3% → 19.9% (M0 → M3, 8B)

Iterative training yields steady gains over rounds.

NumbersMATH-500 Pass@1 improves across iterations: 59.4% → 63.2% (Step-KTO M1 → M3, 8B)

Results

Pass@1 (MATH-500, Llama-3.1-8B-Instruct)

Value63.2%

Baseline53.4% (seed M0)

Pass@1 (MATH-500, Llama-3.3-70B-Instruct)

Value79.6%

Baseline75.8% (seed M0)

Stepwise error rate in correct solutions (MATH-500, 8B)

Value19.9% (Step-KTO M3)

Baseline27.3% (seed M0)

Iterative Pass@1 trend (MATH-500, 8B Step-KTO)

Value59.4% → 63.2% (M1 → M3)

Who Should Care

What To Try In 7 Days

Set up a simple stepwise judge: prompt a strong LLM to label each reasoning step as correct/incorrect for a small math subset.

Combine those binary step labels with a regex+sympy outcome checker and finetune a small model for a few iterations using weighted losses.

Measure Pass@1 and the share of 'correct-final but flawed-step' cases; tune the step/outcome weight to balance accuracy and step fidelity.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on ground-truth solutions and a capable PRM; not applicable when step-by-step references are unavailable.
  • Outcome and step labels can be noisy (regex extraction edge cases, judge mistakes), which can limit final gains.
  • Requires substantial compute (authors used 64 H100 GPUs) for the reported iterative process.

When Not To Use

  • You lack verified ground-truth solutions for intermediate steps.
  • Compute or budget is too limited for iterative finetuning and judge queries.
  • Task outputs are not naturally decomposable into discrete reasoning steps.

Failure Modes

  • PRM bias or errors produce misleading step labels that degrade learning.
  • Outcome regex/sympy checks mis-evaluate valid but atypical answer formats (false negatives).
  • Model may exploit stepwise signal by repeating trivial correct-looking steps instead of substantive reasoning.

Core Entities

Models

  • Llama-3.1-8B-Instruct
  • Llama-3.1-70B-Instruct
  • Llama-3.3-70B-Instruct
  • Llama-3.1-70B-PRM
  • QwQ-32B-Preview
  • O1
  • Gemini 1.5 Pro
  • GPT-4o
  • Claude 3.5 Sonnet
  • Grok-Beta

Metrics

  • Pass@1
  • Maj@8
  • Stepwise error rate (errors in correct final answers)

Datasets

  • MATH-500
  • AMC23
  • AIME24
  • NuminaMath
  • ProcessBench

Benchmarks

  • MATH-500
  • AMC23
  • AIME24
  • ProcessBench