Pick the layer to prune based on attention stability, so token pruning adapts to task difficulty and keeps accuracy high

January 12, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.65

Citation Count

0

Authors

Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao

Links

Abstract / PDF

Why It Matters For Business

ASL gives a simple, no-training way to improve accuracy on long-context and retrieval-heavy tasks while keeping decoding cost and memory low; expect longer prefilling but better answers for hard queries.

Summary TLDR

ASL is a training-free method that decides where to select (prune) tokens during LLM prefilling by watching how token attention ranks stabilize across layers. When token-rank variance falls below a threshold, ASL picks that layer and propagates only the chosen tokens to deeper layers (one-shot). In experiments on long-context benchmarks (InfiniteBench, RULER, NIAH) and two large models, ASL improves accuracy over fixed-layer methods while keeping decoding speed and memory similar. Trade-off: ASL usually increases prefilling time but reduces accuracy loss on hard retrieval tasks.

Problem Statement

Layer-wise token pruning methods use fixed layers to select which tokens to keep in the KV cache. That choice varies in quality across tasks: early selection hurts hard retrieval or high-similarity tasks; late selection hurts memory/time. There is no simple, task-aware way to pick the selection layer during inference.

Main Contribution

Identify that fixed selection layers cause large accuracy swings across tasks and that attention ranks stabilize at different depths per task.

Introduce ASL, a lightweight, training-free rule that monitors relative variance of token ranks across recent layers and triggers token selection when variance is low.

Show ASL integrates with existing KV-reduction tools (SnapKV, GemFilter) and yields better accuracy on long-context benchmarks while maintaining comparable decoding speed and memory.

Provide theoretical cost analysis and open-source implementation for reproduction.

Key Findings

ASL improves average accuracy on long-context benchmarks versus fixed-layer selection.

NumbersInfiniteBench avg: ASL_2pass 37.8 vs FastKV 36.4 (Llama-3.1-8B-UL, KV=2048), ASL 38.7 vs FastKV 36.8 (Full-before->2048,

ASL keeps decoding cost (TPOT) close to other pruning methods.

NumbersTPOT ratio to FullKV at 128k: ASL ≈0.28 vs FastKV ≈0.27 (Llama-3.1-8B-UL, KV=2048)

ASL increases time-to-first-token (prefill cost) compared to fixed early selection.

NumbersTTFT ratio to FullKV at 128k: ASL ≈0.79 vs FastKV ≈0.50 (Llama-3.1-8B-UL, KV=2048)

ASL adds negligible extra memory compared to other KV-reduction methods.

NumbersMemory on InfiniteBench (avg): FullKV 18.6 GB vs ASL 0.3 GB (Llama-3.1-8B-UL, KV=2048)

Results

Accuracy

ValueASL_2pass 37.8, ASL 36.7 (Llama-3.1-8B-UL, KV=2048)

BaselineFastKV 36.4 (Llama-3.1-8B-UL, KV=2048)

Accuracy

ValueASL 69.2 vs FastKV 60.6 (Llama-3.1-8B-UL, Full before->2048)

BaselineFastKV 60.6

TTFT (prefill) ratio to FullKV

ValueASL ≈0.79, FastKV ≈0.50 (Llama-3.1-8B-UL, RULER 128k)

BaselineFullKV = 1.0

TPOT (per-output token) ratio to FullKV

ValueASL ≈0.28, FastKV ≈0.27 (Llama-3.1-8B-UL, RULER 128k)

BaselineFullKV = 1.0

Memory (peak KV cache)

ValueASL 0.3 GB vs FullKV 18.6 GB (InfiniteBench avg)

BaselineFullKV 18.6 GB

Who Should Care

What To Try In 7 Days

Run ASL in prefilling together with SnapKV on a small production long-context workload and compare accuracy and TTFT to your current fixed-layer selection.

Use τ = 0.3 (authors' default) and KV budget = 2048; measure selection-layer distribution and prefilling time for 1k–100k contexts.

If you need lower TTFT, test increasing τ to force earlier selection and measure accuracy drop vs time saved.

Optimization Features

Token Efficiency

  • KV cache reduction
  • top-k token selection

System Optimization

  • prefill-time monitoring
  • integration with SnapKV/GemFilter

Inference Optimization

  • layer-wise token pruning
  • adaptive selection layer
  • one-shot token propagation

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Designed and evaluated mainly for one-shot layer-wise pruning; integration with multi-shot/progressive methods is untested.
  • Only two LLMs evaluated (Llama-3.1-8B-UL and Qwen2.5-7B); results may vary on other models.
  • Less compatibility observed when combining ASL with GemFilter on Llama-3.1-8B-UL.

When Not To Use

  • When minimizing time-to-first-token is the top priority and accepting accuracy loss is acceptable.
  • If you already use dynamic multi-shot pruning methods and cannot modify their selection logic.
  • On models or stacks where pooled attention/rank computation adds unacceptable engineering complexity.

Failure Modes

  • Wrong threshold τ can trigger selection too early or too late, harming accuracy (authors show sensitivity across τ).
  • Tasks with extremely uniform attention across layers may never meet the variance threshold, delaying selection.
  • Model-specific implementation differences (e.g., UltraLong variants) can change timing and throughput.

Core Entities

Models

  • Llama-3.1-8B-UL
  • Qwen2.5-7B

Metrics

  • Accuracy
  • TTFT (time to first token)
  • TPOT (time per output token)
  • throughput
  • memory (GB)

Datasets

  • InfiniteBench
  • RULER
  • Needle in a Haystack (NIAH)

Benchmarks

  • InfiniteBench
  • RULER
  • NIAH