Prune redundant reasoning tokens at inference to boost accuracy and shrink KV cache

June 17, 20257 min

Overview

Decision SnapshotNeeds Validation

The approach is a low-risk, inference-only change with clear gains on math benchmarks and measurable KV reductions; it requires models that expose visible chain-of-thought traces.

Citations0

Evidence Strength0.70

Confidence0.75

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

Links

Abstract / PDF / Data

Why It Matters For Business

You can boost reasoning accuracy and cut inference memory by changing only the decoder strategy, making long-form reasoning cheaper and more reliable without retraining.

Who Should Care

Summary TLDR

The paper removes redundant tokens from a model's reasoning trace at test time. It injects a short 'end-of-thinking' summarization prompt periodically, uses attention from that token to score prior tokens, segments the trace into reasoning steps, and evicts low-contribution tokens stepwise from the KV cache. This plug-and-play inference change improves accuracy on math reasoning benchmarks (e.g., AMC2023: 75.0→82.5) and reduces KV memory (≈10% reported) without retraining.

Problem Statement

Chain-of-thought reasoning traces often include repetitive or distracting intermediate text. These redundant tokens bloat KV cache memory and can distract the model, hurting final-answer quality. The paper asks whether on-the-fly eviction of low-contribution tokens can both improve reasoning accuracy and reduce memory use, without retraining.

Main Contribution

A test-time token-pruning method that uses a forced summarization prompt and attention to an end-of-thinking token (</think>) to score token importance.

A structure-aware, stepwise eviction policy that segments reasoning traces into steps and prioritizes pruning low-contribution steps over isolated tokens.

Key Findings

Plug-in pruning raises average accuracy for Qwen2.5-7B from 57.9% to 63.4% on six math benchmarks.

Numbers57.9%63.4% average (Table 1)

Practical UseYou can improve factual math-answer rates by switching to inference-time redundant-token pruning without retraining.

Evidence RefTable 1

On AMC2023, Qwen2.5-7B accuracy improves from 75.0% to 82.5% while reducing KV cache size by ~10.3%.

Numbers75.0%82.5%; KV 50044488 (~10.3%) (Table 1)

Practical UseFor hard competition-style problems, pruning yields large accuracy gains and memory savings—useful in production where both matter.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy57.9%63.4%FullKV+5.5 ppTable 1 (MATH, Minerva, GaoKao, AIME2024/25, AMC2023)Average accuracy improves from 57.9% (FullKV) to 63.4% (Ours).Table 1
Accuracy75.0%82.5%FullKV+7.5 ppAMC2023Substantial gain on competition-style problems.Table 1

What To Try In 7 Days

Add a short summarization trigger (</think>) every ~200 tokens during decoding to score token importance.

Segment generated reasoning using simple markers (e.g., 'Wait', 'Alternatively', 'Thus') and compute per-step importance from attention to </think>.

Evict lowest-scoring tokens within low-importance steps under a fixed budget and resume generation; measure accuracy and KV size vs FullKV.

Agent Features

Memory
KV cache eviction (short-term context compression)

Optimization Features

Token Efficiency
per-token importance scoring via attention to </think>hierarchical budget allocation across reasoning steps
Infra Optimization
reduces KV memory footprint (helps batch throughput and edge deployment)
Inference Optimization
KV cache token evictionstep-aware context compressionperiodic summarization-triggered pruning

Reproducibility

Risks & Boundaries

Limitations

Requires models that output visible chain-of-thought or explicit <think> markers; not applicable to opaque reasoning models.

Only applied at test time; no training-time adaptation or joint learning of pruning.

When Not To Use

The model does not expose intermediate reasoning traces.

You cannot modify the decoding pipeline or insert summarization prompts.

Failure Modes

Summarization trigger yields misleading attention, causing removal of crucial tokens and accuracy drop.

Incorrect segmentation groups important content into low-score steps, leading to over-pruning.

Core Entities

Models

Qwen2.5-1.5BQwen2.5-7BLlama3.1-8BDeepSeek-R1-Distill family

Metrics

AccuracyAverage KV cache length (tokens)Compression ratio (%)

Datasets

MATH-500Minerva MathGaoKaoAIME2024AIME2025AMC2023GPQA Diamond

Benchmarks

MATH-500MinervaAIMEAMCGaoKaoGPQA Diamond