Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.65
Citation Count
0
Why It Matters For Business
ASL gives a simple, no-training way to improve accuracy on long-context and retrieval-heavy tasks while keeping decoding cost and memory low; expect longer prefilling but better answers for hard queries.
Summary TLDR
ASL is a training-free method that decides where to select (prune) tokens during LLM prefilling by watching how token attention ranks stabilize across layers. When token-rank variance falls below a threshold, ASL picks that layer and propagates only the chosen tokens to deeper layers (one-shot). In experiments on long-context benchmarks (InfiniteBench, RULER, NIAH) and two large models, ASL improves accuracy over fixed-layer methods while keeping decoding speed and memory similar. Trade-off: ASL usually increases prefilling time but reduces accuracy loss on hard retrieval tasks.
Problem Statement
Layer-wise token pruning methods use fixed layers to select which tokens to keep in the KV cache. That choice varies in quality across tasks: early selection hurts hard retrieval or high-similarity tasks; late selection hurts memory/time. There is no simple, task-aware way to pick the selection layer during inference.
Main Contribution
Identify that fixed selection layers cause large accuracy swings across tasks and that attention ranks stabilize at different depths per task.
Introduce ASL, a lightweight, training-free rule that monitors relative variance of token ranks across recent layers and triggers token selection when variance is low.
Show ASL integrates with existing KV-reduction tools (SnapKV, GemFilter) and yields better accuracy on long-context benchmarks while maintaining comparable decoding speed and memory.
Provide theoretical cost analysis and open-source implementation for reproduction.
Key Findings
ASL improves average accuracy on long-context benchmarks versus fixed-layer selection.
ASL keeps decoding cost (TPOT) close to other pruning methods.
ASL increases time-to-first-token (prefill cost) compared to fixed early selection.
ASL adds negligible extra memory compared to other KV-reduction methods.
Results
Accuracy
Accuracy
TTFT (prefill) ratio to FullKV
TPOT (per-output token) ratio to FullKV
Memory (peak KV cache)
Who Should Care
What To Try In 7 Days
Run ASL in prefilling together with SnapKV on a small production long-context workload and compare accuracy and TTFT to your current fixed-layer selection.
Use τ = 0.3 (authors' default) and KV budget = 2048; measure selection-layer distribution and prefilling time for 1k–100k contexts.
If you need lower TTFT, test increasing τ to force earlier selection and measure accuracy drop vs time saved.
Optimization Features
Token Efficiency
- KV cache reduction
- top-k token selection
System Optimization
- prefill-time monitoring
- integration with SnapKV/GemFilter
Inference Optimization
- layer-wise token pruning
- adaptive selection layer
- one-shot token propagation
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Designed and evaluated mainly for one-shot layer-wise pruning; integration with multi-shot/progressive methods is untested.
- Only two LLMs evaluated (Llama-3.1-8B-UL and Qwen2.5-7B); results may vary on other models.
- Less compatibility observed when combining ASL with GemFilter on Llama-3.1-8B-UL.
When Not To Use
- When minimizing time-to-first-token is the top priority and accepting accuracy loss is acceptable.
- If you already use dynamic multi-shot pruning methods and cannot modify their selection logic.
- On models or stacks where pooled attention/rank computation adds unacceptable engineering complexity.
Failure Modes
- Wrong threshold τ can trigger selection too early or too late, harming accuracy (authors show sensitivity across τ).
- Tasks with extremely uniform attention across layers may never meet the variance threshold, delaying selection.
- Model-specific implementation differences (e.g., UltraLong variants) can change timing and throughput.
Core Entities
Models
- Llama-3.1-8B-UL
- Qwen2.5-7B
Metrics
- Accuracy
- TTFT (time to first token)
- TPOT (time per output token)
- throughput
- memory (GB)
Datasets
- InfiniteBench
- RULER
- Needle in a Haystack (NIAH)
Benchmarks
- InfiniteBench
- RULER
- NIAH

