Pick the layer to prune based on attention stability, so token pruning adapts to task difficulty and keeps accuracy high

Overview

Decision SnapshotReady For Pilot

ASL is a lightweight, inference-only change that trades prefilling time for better accuracy on hard long-context tasks; evidence is from two models across three public benchmarks.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 60%

Authors

Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ASL gives a simple, no-training way to improve accuracy on long-context and retrieval-heavy tasks while keeping decoding cost and memory low; expect longer prefilling but better answers for hard queries.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

ASL is a training-free method that decides where to select (prune) tokens during LLM prefilling by watching how token attention ranks stabilize across layers. When token-rank variance falls below a threshold, ASL picks that layer and propagates only the chosen tokens to deeper layers (one-shot). In experiments on long-context benchmarks (InfiniteBench, RULER, NIAH) and two large models, ASL improves accuracy over fixed-layer methods while keeping decoding speed and memory similar. Trade-off: ASL usually increases prefilling time but reduces accuracy loss on hard retrieval tasks.

Problem Statement

Layer-wise token pruning methods use fixed layers to select which tokens to keep in the KV cache. That choice varies in quality across tasks: early selection hurts hard retrieval or high-similarity tasks; late selection hurts memory/time. There is no simple, task-aware way to pick the selection layer during inference.

Main Contribution

Identify that fixed selection layers cause large accuracy swings across tasks and that attention ranks stabilize at different depths per task.

Introduce ASL, a lightweight, training-free rule that monitors relative variance of token ranks across recent layers and triggers token selection when variance is low.

Key Findings

ASL improves average accuracy on long-context benchmarks versus fixed-layer selection.

NumbersInfiniteBench avg: ASL_2pass 37.8 vs FastKV 36.4 (Llama-3.1-8B-UL, KV=2048), ASL 38.7 vs FastKV 36.8 (Full-before->2048,

Practical UseIf accuracy on hard long-context tasks matters, switch to ASL (or ASL_2pass) and keep the same KV budget to recover several percent absolute accuracy on evaluated benchmarks.

Evidence RefTable 2

ASL keeps decoding cost (TPOT) close to other pruning methods.

NumbersTPOT ratio to FullKV at 128k: ASL ≈0.28 vs FastKV ≈0.27 (Llama-3.1-8B-UL, KV=2048)

Practical UseExpect similar per-output-token speed during generation after prefilling; ASL's runtime hit is mostly in the prefilling stage, not decoding.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	ASL_2pass 37.8, ASL 36.7 (Llama-3.1-8B-UL, KV=2048)	FastKV 36.4 (Llama-3.1-8B-UL, KV=2048)	+1.4 to +1.4 pts	InfiniteBench (10 tasks)	ASL_2pass and ASL outperform FastKV on average accuracy under KV=2048	Table 2
Accuracy	ASL 69.2 vs FastKV 60.6 (Llama-3.1-8B-UL, Full before->2048)	FastKV 60.6	+8.6 pts	RULER (128k)	When using full KV before selection, ASL markedly improves accuracy at long contexts	Table 3

What To Try In 7 Days

Run ASL in prefilling together with SnapKV on a small production long-context workload and compare accuracy and TTFT to your current fixed-layer selection.

Use τ = 0.3 (authors' default) and KV budget = 2048; measure selection-layer distribution and prefilling time for 1k–100k contexts.

If you need lower TTFT, test increasing τ to force earlier selection and measure accuracy drop vs time saved.

Optimization Features

Token Efficiency

KV cache reductiontop-k token selection

System Optimization

prefill-time monitoringintegration with SnapKV/GemFilter

Inference Optimization

layer-wise token pruningadaptive selection layerone-shot token propagation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/TANIGUCHIREI/ASL

Data URLs

https://github.com/microsoft/MInference/tree/main https://github.com/gkamradt/LLMTest_NeedleInAHaystack

Risks & Boundaries

Limitations

Designed and evaluated mainly for one-shot layer-wise pruning; integration with multi-shot/progressive methods is untested.

Only two LLMs evaluated (Llama-3.1-8B-UL and Qwen2.5-7B); results may vary on other models.

When Not To Use

When minimizing time-to-first-token is the top priority and accepting accuracy loss is acceptable.

If you already use dynamic multi-shot pruning methods and cannot modify their selection logic.

Failure Modes

Wrong threshold τ can trigger selection too early or too late, harming accuracy (authors show sensitivity across τ).

Tasks with extremely uniform attention across layers may never meet the variance threshold, delaying selection.

Core Entities

Models

Llama-3.1-8B-ULQwen2.5-7B

Metrics

AccuracyTTFT (time to first token)TPOT (time per output token)throughputmemory (GB)

Datasets

InfiniteBenchRULERNeedle in a Haystack (NIAH)

Benchmarks

InfiniteBenchRULERNIAH

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ASL improves average accuracy on long-context benchmarks versus fixed-layer selection.

ASL keeps decoding cost (TPOT) close to other pruning methods.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding