Prune redundant reasoning tokens at inference to boost accuracy and shrink KV cache

Overview

Decision SnapshotNeeds Validation

The approach is a low-risk, inference-only change with clear gains on math benchmarks and measurable KV reductions; it requires models that expose visible chain-of-thought traces.

Citations0

Evidence Strength0.70

Confidence0.75

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

Links

Abstract / PDF / Data

Why It Matters For Business

You can boost reasoning accuracy and cut inference memory by changing only the decoder strategy, making long-form reasoning cheaper and more reliable without retraining.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Data Scientist

Summary TLDR

The paper removes redundant tokens from a model's reasoning trace at test time. It injects a short 'end-of-thinking' summarization prompt periodically, uses attention from that token to score prior tokens, segments the trace into reasoning steps, and evicts low-contribution tokens stepwise from the KV cache. This plug-and-play inference change improves accuracy on math reasoning benchmarks (e.g., AMC2023: 75.0→82.5) and reduces KV memory (≈10% reported) without retraining.

Problem Statement

Chain-of-thought reasoning traces often include repetitive or distracting intermediate text. These redundant tokens bloat KV cache memory and can distract the model, hurting final-answer quality. The paper asks whether on-the-fly eviction of low-contribution tokens can both improve reasoning accuracy and reduce memory use, without retraining.

Main Contribution

A test-time token-pruning method that uses a forced summarization prompt and attention to an end-of-thinking token (</think>) to score token importance.

A structure-aware, stepwise eviction policy that segments reasoning traces into steps and prioritizes pruning low-contribution steps over isolated tokens.

Key Findings

Plug-in pruning raises average accuracy for Qwen2.5-7B from 57.9% to 63.4% on six math benchmarks.

Numbers57.9% → 63.4% average (Table 1)

Practical UseYou can improve factual math-answer rates by switching to inference-time redundant-token pruning without retraining.

Evidence RefTable 1

On AMC2023, Qwen2.5-7B accuracy improves from 75.0% to 82.5% while reducing KV cache size by ~10.3%.

Numbers75.0% → 82.5%; KV 5004 → 4488 (~10.3%) (Table 1)

Practical UseFor hard competition-style problems, pruning yields large accuracy gains and memory savings—useful in production where both matter.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	57.9% → 63.4%	FullKV	+5.5 pp	Table 1 (MATH, Minerva, GaoKao, AIME2024/25, AMC2023)	Average accuracy improves from 57.9% (FullKV) to 63.4% (Ours).	Table 1
Accuracy	75.0% → 82.5%	FullKV	+7.5 pp	AMC2023	Substantial gain on competition-style problems.	Table 1

What To Try In 7 Days

Add a short summarization trigger (</think>) every ~200 tokens during decoding to score token importance.

Segment generated reasoning using simple markers (e.g., 'Wait', 'Alternatively', 'Thus') and compute per-step importance from attention to </think>.

Evict lowest-scoring tokens within low-importance steps under a fixed budget and resume generation; measure accuracy and KV size vs FullKV.

Agent Features

Memory

KV cache eviction (short-term context compression)

Optimization Features

Token Efficiency

per-token importance scoring via attention to </think>hierarchical budget allocation across reasoning steps

Infra Optimization

reduces KV memory footprint (helps batch throughput and edge deployment)

Inference Optimization

KV cache token evictionstep-aware context compressionperiodic summarization-triggered pruning

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/AI-MO/aimo-validation-aime https://huggingface.co/datasets/HuggingFaceH4/aime_2024 https://huggingface.co/datasets/opencompass/AIME2025

Risks & Boundaries

Limitations

Requires models that output visible chain-of-thought or explicit <think> markers; not applicable to opaque reasoning models.

Only applied at test time; no training-time adaptation or joint learning of pruning.

When Not To Use

The model does not expose intermediate reasoning traces.

You cannot modify the decoding pipeline or insert summarization prompts.

Failure Modes

Summarization trigger yields misleading attention, causing removal of crucial tokens and accuracy drop.

Incorrect segmentation groups important content into low-score steps, leading to over-pruning.

Core Entities

Models

Qwen2.5-1.5BQwen2.5-7BLlama3.1-8BDeepSeek-R1-Distill family

Metrics

AccuracyAverage KV cache length (tokens)Compression ratio (%)

Datasets

MATH-500Minerva MathGaoKaoAIME2024AIME2025AMC2023GPQA Diamond

Benchmarks

MATH-500MinervaAIMEAMCGaoKaoGPQA Diamond

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Plug-in pruning raises average accuracy for Qwen2.5-7B from 57.9% to 63.4% on six math benchmarks.

On AMC2023, Qwen2.5-7B accuracy improves from 75.0% to 82.5% while reducing KV cache size by ~10.3%.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train agents to skip redundant thoughts and past observations to cut token cost while keeping accuracy

Key finding

D-MEM: dopamine-inspired memory router cuts token costs 80% and improves multi-hop reasoning

Key finding

BAVT: a training-free tree search that spends fewer tokens and tool calls to match or beat brute-force scaling

Key finding

Survey reframing LLM reasoning from fixed efficiency to input-aware adaptivity

Key finding

Pick the layer to prune based on attention stability, so token pruning adapts to task difficulty and keeps accuracy high

Key finding