Make sparse attention pick just enough tokens at runtime with top-p pruning to speed long-context LLMs

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

Authors

Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao

Links

Abstract / PDF

Why It Matters For Business

Twilight reduces memory reads and latency for long-context serving, cutting compute cost and enabling larger context use-cases without retraining models.

Summary TLDR

Twilight converts fixed-budget top-k sparse attention into an adaptive top-p (nucleus) pruning wrapper. It sits on top of existing token-selection methods, keeps a conservative candidate set, then prunes to the smallest set whose estimated attention mass exceeds p. On long- and medium-context benchmarks it prunes up to 98% redundant tokens while keeping accuracy nearly unchanged, gives up to 15.8× speedup on the self-attention operator versus dense FlashAttention2, and yields ~1.4× speedup versus the same sparse base method. It uses an INT4 paged K cache and GPU-friendly topp kernels and targets serving scenarios with large batch sizes or CPU offload.

Problem Statement

Most sparse attention systems pick a fixed number of tokens (top-k) to compute attention. Fixed budgets either over-select (waste memory/bandwidth) or under-select (hurt accuracy) because attention weight shapes vary across heads, layers, and queries. Picking a single k requires offline tuning per algorithm and model and fails at runtime dynamism.

Main Contribution

Propose using top-p (topp) pruning for attention: pick the minimum tokens whose estimated attention sum ≥ p.

Design Twilight, a select-then-prune wrapper that can be applied to existing sparse attention algorithms to make budget decisions adaptive.

Implement practical system pieces: INT4 paged K cache, SpGEMV kernel, and a binary-search topp kernel, and demonstrate large speedups on serving workloads.

Key Findings

Twilight can prune most redundant tokens after a conservative selection step.

NumbersPruned up to 98% of over-selected tokens

Large operator speedups versus dense attention and gains over base sparse methods.

NumbersSelf-attention speedups: up to 15.8× vs FlashAttention2; 1.4× vs base sparse method

Accuracy is preserved in evaluated benchmarks after pruning.

NumbersNearly zero accuracy loss (<1%) on LLaMA-3.1; +5.7% avg improvement when applied to DS on Longbench

INT4 quantized K cache is a practical tradeoff for topp estimation.

NumbersINT4 maintains stability; 2-bit drops cumulative weight significantly (Figure 6)

Results

Max self-attention operator speedup

Value15.8×

BaselineFlashAttention2

Speedup over same base sparse method

Value1.4×

BaselineQuest or FlashInfer without Twilight

End-to-end decoding speedup (best reported)

Value3.9×

BaselineFlashInfer

Maximum pruning of selected tokens

Value98% pruned

Baselineoriginal base selection

Accuracy

ValueNearly zero loss

Baselinefull attention / base sparse

INT4 quantization stability

ValueINT4 holds attention sum well; 2-bit degrades

Baseline2-bit and 8-bit alternatives

Who Should Care

CtoProduct ManagerMl EngineerEngineering Lead

What To Try In 7 Days

Wrap an existing sparse attention implementation (Quest or FlashInfer) with the select-then-prune step and run end-to-end tests.

Calibrate topp threshold p on a small validation set (start at p≈0.85) and measure TPOT and perplexity.

Add an INT4 paged K cache and run SpGEMV kernels to measure bandwidth savings and latency.

Optimization Features

Token Efficiency

Head/group-wise varlen budgets via topp
Prune to cumulative attention mass p

Infra Optimization

Avoids loading full KV for most tokens
Better gains when KV offload from CPU dominates

System Optimization

Paged INT4 cache layout to align with KV pages
Load-balancing by flattening head dimension
Kernel fusion for elementwise ops in topp

Inference Optimization

Adaptive top-p (topp) token pruning
Hierarchical select-then-prune pipeline
INT4 paged K cache to reduce memory bandwidth
GPU-friendly binary-search topp kernel and SpGEMV

Reproducibility

Code Urls

https://github.com/tsinghua-ideal/Twilight

Code Available

Open Source Status

Risks & Boundaries

Limitations

Twilight adds an estimation/pruner overhead that can offset gains for single-request or tiny-batch cases.
Head-wise dynamism conflicts with Group Query Attention (GQA), requiring group-level compromises.
Maintains an extra INT4 K cache, adding ~1/8 extra KV memory in some deployments.
Topp requires higher numeric precision than top-k, so overly aggressive quantization (e.g., 2-bit) harms stability.

When Not To Use

Low-latency single-request workloads where pruner overhead dominates.
Models or kernels that cannot provide per-head attention weight estimates.
Scenarios where the system cannot afford the extra INT4 cache memory.

Failure Modes

Estimation overhead outweighs sparse-attention savings on small batches.
Too-low quantization (≤2-bit) underestimates attention mass and breaks pruning.
GQA architectures force group budgets that reduce head-wise pruning benefits.
Mis-tuned p can under- or over-prune and reduce accuracy or waste work.

Core Entities

Models

Longchat-7B-v1.5-32k
LLaMA2-7B-Chat
LLaMA-3.1-8B-Instruct

Metrics

self-attention speedup
end-to-end decoding speedup
perplexity
average benchmark score
TPOT (time per output token)

Datasets

Longbench
RULER
GSM8K
COQA
PG-19

Benchmarks

Longbench
RULER
PG-19
GSM8K
COQA