Overview
The approach is a low-risk, inference-only change with clear gains on math benchmarks and measurable KV reductions; it requires models that expose visible chain-of-thought traces.
Citations0
Evidence Strength0.70
Confidence0.75
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
You can boost reasoning accuracy and cut inference memory by changing only the decoder strategy, making long-form reasoning cheaper and more reliable without retraining.
Who Should Care
Summary TLDR
The paper removes redundant tokens from a model's reasoning trace at test time. It injects a short 'end-of-thinking' summarization prompt periodically, uses attention from that token to score prior tokens, segments the trace into reasoning steps, and evicts low-contribution tokens stepwise from the KV cache. This plug-and-play inference change improves accuracy on math reasoning benchmarks (e.g., AMC2023: 75.0→82.5) and reduces KV memory (≈10% reported) without retraining.
Problem Statement
Chain-of-thought reasoning traces often include repetitive or distracting intermediate text. These redundant tokens bloat KV cache memory and can distract the model, hurting final-answer quality. The paper asks whether on-the-fly eviction of low-contribution tokens can both improve reasoning accuracy and reduce memory use, without retraining.
Main Contribution
A test-time token-pruning method that uses a forced summarization prompt and attention to an end-of-thinking token (</think>) to score token importance.
A structure-aware, stepwise eviction policy that segments reasoning traces into steps and prioritizes pruning low-contribution steps over isolated tokens.
Key Findings
Plug-in pruning raises average accuracy for Qwen2.5-7B from 57.9% to 63.4% on six math benchmarks.
On AMC2023, Qwen2.5-7B accuracy improves from 75.0% to 82.5% while reducing KV cache size by ~10.3%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 57.9% → 63.4% | FullKV | +5.5 pp | Table 1 (MATH, Minerva, GaoKao, AIME2024/25, AMC2023) | Average accuracy improves from 57.9% (FullKV) to 63.4% (Ours). | Table 1 |
| Accuracy | 75.0% → 82.5% | FullKV | +7.5 pp | AMC2023 | Substantial gain on competition-style problems. | Table 1 |
What To Try In 7 Days
Add a short summarization trigger (</think>) every ~200 tokens during decoding to score token importance.
Segment generated reasoning using simple markers (e.g., 'Wait', 'Alternatively', 'Thus') and compute per-step importance from attention to </think>.
Evict lowest-scoring tokens within low-importance steps under a fixed budget and resume generation; measure accuracy and KV size vs FullKV.
Agent Features
Memory
Optimization Features
Token Efficiency
Infra Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires models that output visible chain-of-thought or explicit <think> markers; not applicable to opaque reasoning models.
Only applied at test time; no training-time adaptation or joint learning of pruning.
When Not To Use
The model does not expose intermediate reasoning traces.
You cannot modify the decoding pipeline or insert summarization prompts.
Failure Modes
Summarization trigger yields misleading attention, causing removal of crucial tokens and accuracy drop.
Incorrect segmentation groups important content into low-score steps, leading to over-pruning.

