Do multi-step math without long traces: refine compact latent anchors and stop when stable

March 16, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta

Links

Abstract / PDF

Why It Matters For Business

AdaAnchor can cut output-token costs by over 90% and halve silent compute iterations on average. That lowers inference bandwidth and token billing for applications that only need final answers (e.g., calculators, automated graders) while preserving or improving accuracy in some cases.

Summary TLDR

AdaAnchor moves multi-step reasoning into a small set of learnable latent vectors (anchors) that the model refines silently. It stops refining per example when anchor changes stabilize, so easy problems use fewer refinement steps. On three math benchmarks with small backbones, adaptive halting cuts average latent steps by about half and reduces generated tokens ~92–93% versus token-level Chain-of-Thought, while matching or slightly improving accuracy versus fixed-step latent refinement.

Problem Statement

Token-level chain-of-thought helps LLMs reason but costs many output tokens and latency. Existing latent reasoning methods use a fixed number of silent refinement steps, adding a hyperparameter and leading to wasted compute on easy inputs. The paper asks: can a compact latent state be refined adaptively per example and halted when converged to save computation and output tokens?

Main Contribution

AdaAnchor: attach m learnable latent anchor vectors to the input and iteratively refine them through repeated forward passes, keeping the output answer-only.

Stability-based adaptive halting: stop refinement when the mean anchor vector change (cosine distance) stays below a threshold for s consecutive steps, enabling per-example compute allocation under a shared max budget.

Implementation recipe: freeze backbone, train only anchors + small projector and LoRA adapters; evaluate on GSM8K, SVAMP, MultiArith using two small backbones (Qwen2.5-1.5B, Llama-3.2-1B).

Empirical finding: adaptive halting reduces average latent steps ~48-61% vs fixed K and cuts generated tokens by ~92-93% vs token-level CoT while maintaining or improving accuracy in several settings.

Key Findings

Adaptive halting sharply reduces average latent refinement steps compared to a fixed K budget.

NumbersAvg steps reduced ~48–61% (Table 2; adaptive 3.23–4.12 vs fixed 8)

Adaptive AdaAnchor can modestly improve accuracy over fixed-step latent refinement.

NumbersUp to ~5% absolute accuracy gain over fixed-step refinement under same max budget (reported across datasets)

Shifting reasoning into latent anchors cuts generated tokens drastically versus token-level CoT.

NumbersGenerated tokens reduced by ~92–93% (e.g., GSM8K CoT 28.27 tokens → AdaAnchor adaptive 2.17 tokens)

Results

Accuracy

ValueQwen adaptive 16.0%, Qwen fixed K=8 16.0%

BaselineCoT 20.0% (Qwen)

Accuracy

ValueQwen adaptive 55.2% vs fixed K=8 50.5%

BaselineCoT 59.3% (Qwen)

Average output tokens

ValueCoT ~25–30 tokens → AdaAnchor adaptive ~2.1–2.8 tokens

BaselineCoT

Average latent refinement steps

ValueAdaptive: ~3.1–4.12 steps vs Fixed K=8: 8 steps

BaselineFixed-step K=8

Who Should Care

What To Try In 7 Days

Add a small set (m) of learnable anchor embeddings to a frozen small backbone and train only anchors + LoRA on your dataset.

Implement the cosine-change halting rule: stop after s consecutive steps with update < τ and enforce a shared K_max.

Compare answer-only token counts and average refinement steps vs your current CoT pipeline to estimate cost savings.

Optimization Features

Token Efficiency

  • reduces generated tokens by ~92–93% vs CoT

System Optimization

  • per-instance compute allocation under shared K_max

Training Optimization

  • LoRA
  • freeze backbone, train anchors and projector

Inference Optimization

  • adaptive halting to cut average refinement iterations
  • silent latent refinement to avoid token-level decoding

Reproducibility

Data Urls

  • GSM8K
  • SVAMP
  • MultiArith

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Halting uses a hand-tuned cosine-change threshold (τ) and patience (s) that may need per-deployment tuning.
  • Anchors are not directly interpretable, so you lose readable rationales for auditing or debugging.
  • Experiments run on small LMs; behavior on large production models or non-math tasks is untested.

When Not To Use

  • When you need human-readable step-by-step rationales for audits or user-facing explanations.
  • When distribution shift is likely and halting hyperparameters are not robustly validated.
  • If your deployment requires proven behavior on large models and diverse tasks (paper tested small backbones).

Failure Modes

  • Halting too early on hard or atypical inputs, producing wrong answers without additional refinement.
  • Halting too late on easy inputs if thresholds poorly set, wasting compute.
  • Anchors converging to spurious states that do not reflect correct intermediate reasoning.

Core Entities

Models

  • Qwen2.5-1.5B
  • Llama-3.2-1B

Metrics

  • Accuracy
  • Average Tokens
  • Average Steps

Datasets

  • GSM8K
  • SVAMP
  • MultiArith

Benchmarks

  • GSM8K
  • SVAMP
  • MultiArith