Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
You can boost reasoning accuracy and cut inference memory by changing only the decoder strategy, making long-form reasoning cheaper and more reliable without retraining.
Summary TLDR
The paper removes redundant tokens from a model's reasoning trace at test time. It injects a short 'end-of-thinking' summarization prompt periodically, uses attention from that token to score prior tokens, segments the trace into reasoning steps, and evicts low-contribution tokens stepwise from the KV cache. This plug-and-play inference change improves accuracy on math reasoning benchmarks (e.g., AMC2023: 75.0→82.5) and reduces KV memory (≈10% reported) without retraining.
Problem Statement
Chain-of-thought reasoning traces often include repetitive or distracting intermediate text. These redundant tokens bloat KV cache memory and can distract the model, hurting final-answer quality. The paper asks whether on-the-fly eviction of low-contribution tokens can both improve reasoning accuracy and reduce memory use, without retraining.
Main Contribution
A test-time token-pruning method that uses a forced summarization prompt and attention to an end-of-thinking token (</think>) to score token importance.
A structure-aware, stepwise eviction policy that segments reasoning traces into steps and prioritizes pruning low-contribution steps over isolated tokens.
Extensive evaluation showing accuracy gains on math reasoning benchmarks and KV cache savings, all without model retraining.
Key Findings
Plug-in pruning raises average accuracy for Qwen2.5-7B from 57.9% to 63.4% on six math benchmarks.
On AMC2023, Qwen2.5-7B accuracy improves from 75.0% to 82.5% while reducing KV cache size by ~10.3%.
At 50% KV budget, the method preserves ~94% of full-KV accuracy on MATH-500.
Both components—self-summarization and step-aware eviction—are required for best gains.
Results
Accuracy
Accuracy
KV cache size (AMC2023, Qwen2.5-7B avg tokens)
Accuracy
Who Should Care
What To Try In 7 Days
Add a short summarization trigger (</think>) every ~200 tokens during decoding to score token importance.
Segment generated reasoning using simple markers (e.g., 'Wait', 'Alternatively', 'Thus') and compute per-step importance from attention to </think>.
Evict lowest-scoring tokens within low-importance steps under a fixed budget and resume generation; measure accuracy and KV size vs FullKV.
Agent Features
Memory
- KV cache eviction (short-term context compression)
Optimization Features
Token Efficiency
- per-token importance scoring via attention to </think>
- hierarchical budget allocation across reasoning steps
Infra Optimization
- reduces KV memory footprint (helps batch throughput and edge deployment)
Inference Optimization
- KV cache token eviction
- step-aware context compression
- periodic summarization-triggered pruning
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires models that output visible chain-of-thought or explicit <think> markers; not applicable to opaque reasoning models.
- Only applied at test time; no training-time adaptation or joint learning of pruning.
- Most experiments focus on symbolic/math reasoning; generalization to open-domain or multimodal tasks is unproven.
- Segmentation relies on a fixed marker set and may mis-segment diverse writing styles.
When Not To Use
- The model does not expose intermediate reasoning traces.
- You cannot modify the decoding pipeline or insert summarization prompts.
- Real-time latency budget forbids periodic summarization triggers.
Failure Modes
- Summarization trigger yields misleading attention, causing removal of crucial tokens and accuracy drop.
- Incorrect segmentation groups important content into low-score steps, leading to over-pruning.
- Small or low-capacity models show mixed or negative gains (see Qwen2.5-1.5B results).
Core Entities
Models
- Qwen2.5-1.5B
- Qwen2.5-7B
- Llama3.1-8B
- DeepSeek-R1-Distill family
Metrics
- Accuracy
- Average KV cache length (tokens)
- Compression ratio (%)
Datasets
- MATH-500
- Minerva Math
- GaoKao
- AIME2024
- AIME2025
- AMC2023
- GPQA Diamond
Benchmarks
- MATH-500
- Minerva
- AIME
- AMC
- GaoKao
- GPQA Diamond

