Prune redundant reasoning tokens at inference to boost accuracy and shrink KV cache

June 17, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

Links

Abstract / PDF

Why It Matters For Business

You can boost reasoning accuracy and cut inference memory by changing only the decoder strategy, making long-form reasoning cheaper and more reliable without retraining.

Summary TLDR

The paper removes redundant tokens from a model's reasoning trace at test time. It injects a short 'end-of-thinking' summarization prompt periodically, uses attention from that token to score prior tokens, segments the trace into reasoning steps, and evicts low-contribution tokens stepwise from the KV cache. This plug-and-play inference change improves accuracy on math reasoning benchmarks (e.g., AMC2023: 75.0→82.5) and reduces KV memory (≈10% reported) without retraining.

Problem Statement

Chain-of-thought reasoning traces often include repetitive or distracting intermediate text. These redundant tokens bloat KV cache memory and can distract the model, hurting final-answer quality. The paper asks whether on-the-fly eviction of low-contribution tokens can both improve reasoning accuracy and reduce memory use, without retraining.

Main Contribution

A test-time token-pruning method that uses a forced summarization prompt and attention to an end-of-thinking token (</think>) to score token importance.

A structure-aware, stepwise eviction policy that segments reasoning traces into steps and prioritizes pruning low-contribution steps over isolated tokens.

Extensive evaluation showing accuracy gains on math reasoning benchmarks and KV cache savings, all without model retraining.

Key Findings

Plug-in pruning raises average accuracy for Qwen2.5-7B from 57.9% to 63.4% on six math benchmarks.

Numbers57.9% → 63.4% average (Table 1)

On AMC2023, Qwen2.5-7B accuracy improves from 75.0% to 82.5% while reducing KV cache size by ~10.3%.

Numbers75.0% → 82.5%; KV 5004 → 4488 (~10.3%) (Table 1)

At 50% KV budget, the method preserves ~94% of full-KV accuracy on MATH-500.

Numbers40.2 / 42.6 = 94.3% retained at 50% budget (Table 2)

Both components—self-summarization and step-aware eviction—are required for best gains.

NumbersAIME2024: 40→46.7 with both; AMC2023: 70→82.5 with both (Table 3)

Results

Accuracy

Value57.9% → 63.4%

BaselineFullKV

Accuracy

Value75.0% → 82.5%

BaselineFullKV

KV cache size (AMC2023, Qwen2.5-7B avg tokens)

Value5004 → 4488 tokens (~10.3% reduction)

BaselineFullKV

Accuracy

Value42.6% (FullKV) → 40.2% (Ours at 50%)

BaselineFullKV

Who Should Care

What To Try In 7 Days

Add a short summarization trigger (</think>) every ~200 tokens during decoding to score token importance.

Segment generated reasoning using simple markers (e.g., 'Wait', 'Alternatively', 'Thus') and compute per-step importance from attention to </think>.

Evict lowest-scoring tokens within low-importance steps under a fixed budget and resume generation; measure accuracy and KV size vs FullKV.

Agent Features

Memory

  • KV cache eviction (short-term context compression)

Optimization Features

Token Efficiency

  • per-token importance scoring via attention to </think>
  • hierarchical budget allocation across reasoning steps

Infra Optimization

  • reduces KV memory footprint (helps batch throughput and edge deployment)

Inference Optimization

  • KV cache token eviction
  • step-aware context compression
  • periodic summarization-triggered pruning

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires models that output visible chain-of-thought or explicit <think> markers; not applicable to opaque reasoning models.
  • Only applied at test time; no training-time adaptation or joint learning of pruning.
  • Most experiments focus on symbolic/math reasoning; generalization to open-domain or multimodal tasks is unproven.
  • Segmentation relies on a fixed marker set and may mis-segment diverse writing styles.

When Not To Use

  • The model does not expose intermediate reasoning traces.
  • You cannot modify the decoding pipeline or insert summarization prompts.
  • Real-time latency budget forbids periodic summarization triggers.

Failure Modes

  • Summarization trigger yields misleading attention, causing removal of crucial tokens and accuracy drop.
  • Incorrect segmentation groups important content into low-score steps, leading to over-pruning.
  • Small or low-capacity models show mixed or negative gains (see Qwen2.5-1.5B results).

Core Entities

Models

  • Qwen2.5-1.5B
  • Qwen2.5-7B
  • Llama3.1-8B
  • DeepSeek-R1-Distill family

Metrics

  • Accuracy
  • Average KV cache length (tokens)
  • Compression ratio (%)

Datasets

  • MATH-500
  • Minerva Math
  • GaoKao
  • AIME2024
  • AIME2025
  • AMC2023
  • GPQA Diamond

Benchmarks

  • MATH-500
  • Minerva
  • AIME
  • AMC
  • GaoKao
  • GPQA Diamond