Make sparse attention pick just enough tokens at runtime with top-p pruning to speed long-context LLMs

February 4, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao

Links

Abstract / PDF

Why It Matters For Business

Twilight reduces memory reads and latency for long-context serving, cutting compute cost and enabling larger context use-cases without retraining models.

Summary TLDR

Twilight converts fixed-budget top-k sparse attention into an adaptive top-p (nucleus) pruning wrapper. It sits on top of existing token-selection methods, keeps a conservative candidate set, then prunes to the smallest set whose estimated attention mass exceeds p. On long- and medium-context benchmarks it prunes up to 98% redundant tokens while keeping accuracy nearly unchanged, gives up to 15.8× speedup on the self-attention operator versus dense FlashAttention2, and yields ~1.4× speedup versus the same sparse base method. It uses an INT4 paged K cache and GPU-friendly topp kernels and targets serving scenarios with large batch sizes or CPU offload.

Problem Statement

Most sparse attention systems pick a fixed number of tokens (top-k) to compute attention. Fixed budgets either over-select (waste memory/bandwidth) or under-select (hurt accuracy) because attention weight shapes vary across heads, layers, and queries. Picking a single k requires offline tuning per algorithm and model and fails at runtime dynamism.

Main Contribution

Propose using top-p (topp) pruning for attention: pick the minimum tokens whose estimated attention sum ≥ p.

Design Twilight, a select-then-prune wrapper that can be applied to existing sparse attention algorithms to make budget decisions adaptive.

Implement practical system pieces: INT4 paged K cache, SpGEMV kernel, and a binary-search topp kernel, and demonstrate large speedups on serving workloads.

Key Findings

Twilight can prune most redundant tokens after a conservative selection step.

NumbersPruned up to 98% of over-selected tokens

Large operator speedups versus dense attention and gains over base sparse methods.

NumbersSelf-attention speedups: up to 15.8× vs FlashAttention2; 1.4× vs base sparse method

Accuracy is preserved in evaluated benchmarks after pruning.

NumbersNearly zero accuracy loss (<1%) on LLaMA-3.1; +5.7% avg improvement when applied to DS on Longbench

INT4 quantized K cache is a practical tradeoff for topp estimation.

NumbersINT4 maintains stability; 2-bit drops cumulative weight significantly (Figure 6)

Results

Max self-attention operator speedup

Value15.8×

BaselineFlashAttention2

Speedup over same base sparse method

Value1.4×

BaselineQuest or FlashInfer without Twilight

End-to-end decoding speedup (best reported)

Value3.9×

BaselineFlashInfer

Maximum pruning of selected tokens

Value98% pruned

Baselineoriginal base selection

Accuracy

ValueNearly zero loss

Baselinefull attention / base sparse

INT4 quantization stability

ValueINT4 holds attention sum well; 2-bit degrades

Baseline2-bit and 8-bit alternatives

Who Should Care

What To Try In 7 Days

Wrap an existing sparse attention implementation (Quest or FlashInfer) with the select-then-prune step and run end-to-end tests.

Calibrate topp threshold p on a small validation set (start at p≈0.85) and measure TPOT and perplexity.

Add an INT4 paged K cache and run SpGEMV kernels to measure bandwidth savings and latency.

Optimization Features

Token Efficiency

  • Head/group-wise varlen budgets via topp
  • Prune to cumulative attention mass p

Infra Optimization

  • Avoids loading full KV for most tokens
  • Better gains when KV offload from CPU dominates

System Optimization

  • Paged INT4 cache layout to align with KV pages
  • Load-balancing by flattening head dimension
  • Kernel fusion for elementwise ops in topp

Inference Optimization

  • Adaptive top-p (topp) token pruning
  • Hierarchical select-then-prune pipeline
  • INT4 paged K cache to reduce memory bandwidth
  • GPU-friendly binary-search topp kernel and SpGEMV

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Twilight adds an estimation/pruner overhead that can offset gains for single-request or tiny-batch cases.
  • Head-wise dynamism conflicts with Group Query Attention (GQA), requiring group-level compromises.
  • Maintains an extra INT4 K cache, adding ~1/8 extra KV memory in some deployments.
  • Topp requires higher numeric precision than top-k, so overly aggressive quantization (e.g., 2-bit) harms stability.

When Not To Use

  • Low-latency single-request workloads where pruner overhead dominates.
  • Models or kernels that cannot provide per-head attention weight estimates.
  • Scenarios where the system cannot afford the extra INT4 cache memory.

Failure Modes

  • Estimation overhead outweighs sparse-attention savings on small batches.
  • Too-low quantization (≤2-bit) underestimates attention mass and breaks pruning.
  • GQA architectures force group budgets that reduce head-wise pruning benefits.
  • Mis-tuned p can under- or over-prune and reduce accuracy or waste work.

Core Entities

Models

  • Longchat-7B-v1.5-32k
  • LLaMA2-7B-Chat
  • LLaMA-3.1-8B-Instruct

Metrics

  • self-attention speedup
  • end-to-end decoding speedup
  • perplexity
  • average benchmark score
  • TPOT (time per output token)

Datasets

  • Longbench
  • RULER
  • GSM8K
  • COQA
  • PG-19

Benchmarks

  • Longbench
  • RULER
  • PG-19
  • GSM8K
  • COQA