Overview
ASL is a lightweight, inference-only change that trades prefilling time for better accuracy on hard long-context tasks; evidence is from two models across three public benchmarks.
Citations0
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 65%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
ASL gives a simple, no-training way to improve accuracy on long-context and retrieval-heavy tasks while keeping decoding cost and memory low; expect longer prefilling but better answers for hard queries.
Who Should Care
Summary TLDR
ASL is a training-free method that decides where to select (prune) tokens during LLM prefilling by watching how token attention ranks stabilize across layers. When token-rank variance falls below a threshold, ASL picks that layer and propagates only the chosen tokens to deeper layers (one-shot). In experiments on long-context benchmarks (InfiniteBench, RULER, NIAH) and two large models, ASL improves accuracy over fixed-layer methods while keeping decoding speed and memory similar. Trade-off: ASL usually increases prefilling time but reduces accuracy loss on hard retrieval tasks.
Problem Statement
Layer-wise token pruning methods use fixed layers to select which tokens to keep in the KV cache. That choice varies in quality across tasks: early selection hurts hard retrieval or high-similarity tasks; late selection hurts memory/time. There is no simple, task-aware way to pick the selection layer during inference.
Main Contribution
Identify that fixed selection layers cause large accuracy swings across tasks and that attention ranks stabilize at different depths per task.
Introduce ASL, a lightweight, training-free rule that monitors relative variance of token ranks across recent layers and triggers token selection when variance is low.
Key Findings
ASL improves average accuracy on long-context benchmarks versus fixed-layer selection.
ASL keeps decoding cost (TPOT) close to other pruning methods.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ASL_2pass 37.8, ASL 36.7 (Llama-3.1-8B-UL, KV=2048) | FastKV 36.4 (Llama-3.1-8B-UL, KV=2048) | +1.4 to +1.4 pts | InfiniteBench (10 tasks) | ASL_2pass and ASL outperform FastKV on average accuracy under KV=2048 | Table 2 |
| Accuracy | ASL 69.2 vs FastKV 60.6 (Llama-3.1-8B-UL, Full before->2048) | FastKV 60.6 | +8.6 pts | RULER (128k) | When using full KV before selection, ASL markedly improves accuracy at long contexts | Table 3 |
What To Try In 7 Days
Run ASL in prefilling together with SnapKV on a small production long-context workload and compare accuracy and TTFT to your current fixed-layer selection.
Use τ = 0.3 (authors' default) and KV budget = 2048; measure selection-layer distribution and prefilling time for 1k–100k contexts.
If you need lower TTFT, test increasing τ to force earlier selection and measure accuracy drop vs time saved.
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Designed and evaluated mainly for one-shot layer-wise pruning; integration with multi-shot/progressive methods is untested.
Only two LLMs evaluated (Llama-3.1-8B-UL and Qwen2.5-7B); results may vary on other models.
When Not To Use
When minimizing time-to-first-token is the top priority and accepting accuracy loss is acceptable.
If you already use dynamic multi-shot pruning methods and cannot modify their selection logic.
Failure Modes
Wrong threshold τ can trigger selection too early or too late, harming accuracy (authors show sensitivity across τ).
Tasks with extremely uniform attention across layers may never meet the variance threshold, delaying selection.

