Pick the layer to prune based on attention stability, so token pruning adapts to task difficulty and keeps accuracy high

January 12, 20267 min

Overview

Decision SnapshotReady For Pilot

ASL is a lightweight, inference-only change that trades prefilling time for better accuracy on hard long-context tasks; evidence is from two models across three public benchmarks.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 60%

Authors

Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ASL gives a simple, no-training way to improve accuracy on long-context and retrieval-heavy tasks while keeping decoding cost and memory low; expect longer prefilling but better answers for hard queries.

Who Should Care

Summary TLDR

ASL is a training-free method that decides where to select (prune) tokens during LLM prefilling by watching how token attention ranks stabilize across layers. When token-rank variance falls below a threshold, ASL picks that layer and propagates only the chosen tokens to deeper layers (one-shot). In experiments on long-context benchmarks (InfiniteBench, RULER, NIAH) and two large models, ASL improves accuracy over fixed-layer methods while keeping decoding speed and memory similar. Trade-off: ASL usually increases prefilling time but reduces accuracy loss on hard retrieval tasks.

Problem Statement

Layer-wise token pruning methods use fixed layers to select which tokens to keep in the KV cache. That choice varies in quality across tasks: early selection hurts hard retrieval or high-similarity tasks; late selection hurts memory/time. There is no simple, task-aware way to pick the selection layer during inference.

Main Contribution

Identify that fixed selection layers cause large accuracy swings across tasks and that attention ranks stabilize at different depths per task.

Introduce ASL, a lightweight, training-free rule that monitors relative variance of token ranks across recent layers and triggers token selection when variance is low.

Key Findings

ASL improves average accuracy on long-context benchmarks versus fixed-layer selection.

NumbersInfiniteBench avg: ASL_2pass 37.8 vs FastKV 36.4 (Llama-3.1-8B-UL, KV=2048), ASL 38.7 vs FastKV 36.8 (Full-before->2048,

Practical UseIf accuracy on hard long-context tasks matters, switch to ASL (or ASL_2pass) and keep the same KV budget to recover several percent absolute accuracy on evaluated benchmarks.

Evidence RefTable 2

ASL keeps decoding cost (TPOT) close to other pruning methods.

NumbersTPOT ratio to FullKV at 128k: ASL ≈0.28 vs FastKV ≈0.27 (Llama-3.1-8B-UL, KV=2048)

Practical UseExpect similar per-output-token speed during generation after prefilling; ASL's runtime hit is mostly in the prefilling stage, not decoding.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyASL_2pass 37.8, ASL 36.7 (Llama-3.1-8B-UL, KV=2048)FastKV 36.4 (Llama-3.1-8B-UL, KV=2048)+1.4 to +1.4 ptsInfiniteBench (10 tasks)ASL_2pass and ASL outperform FastKV on average accuracy under KV=2048Table 2
AccuracyASL 69.2 vs FastKV 60.6 (Llama-3.1-8B-UL, Full before->2048)FastKV 60.6+8.6 ptsRULER (128k)When using full KV before selection, ASL markedly improves accuracy at long contextsTable 3

What To Try In 7 Days

Run ASL in prefilling together with SnapKV on a small production long-context workload and compare accuracy and TTFT to your current fixed-layer selection.

Use τ = 0.3 (authors' default) and KV budget = 2048; measure selection-layer distribution and prefilling time for 1k–100k contexts.

If you need lower TTFT, test increasing τ to force earlier selection and measure accuracy drop vs time saved.

Optimization Features

Token Efficiency
KV cache reductiontop-k token selection
System Optimization
prefill-time monitoringintegration with SnapKV/GemFilter
Inference Optimization
layer-wise token pruningadaptive selection layerone-shot token propagation

Reproducibility

Risks & Boundaries

Limitations

Designed and evaluated mainly for one-shot layer-wise pruning; integration with multi-shot/progressive methods is untested.

Only two LLMs evaluated (Llama-3.1-8B-UL and Qwen2.5-7B); results may vary on other models.

When Not To Use

When minimizing time-to-first-token is the top priority and accepting accuracy loss is acceptable.

If you already use dynamic multi-shot pruning methods and cannot modify their selection logic.

Failure Modes

Wrong threshold τ can trigger selection too early or too late, harming accuracy (authors show sensitivity across τ).

Tasks with extremely uniform attention across layers may never meet the variance threshold, delaying selection.

Core Entities

Models

Llama-3.1-8B-ULQwen2.5-7B

Metrics

AccuracyTTFT (time to first token)TPOT (time per output token)throughputmemory (GB)

Datasets

InfiniteBenchRULERNeedle in a Haystack (NIAH)

Benchmarks

InfiniteBenchRULERNIAH