Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Twilight reduces memory reads and latency for long-context serving, cutting compute cost and enabling larger context use-cases without retraining models.
Summary TLDR
Twilight converts fixed-budget top-k sparse attention into an adaptive top-p (nucleus) pruning wrapper. It sits on top of existing token-selection methods, keeps a conservative candidate set, then prunes to the smallest set whose estimated attention mass exceeds p. On long- and medium-context benchmarks it prunes up to 98% redundant tokens while keeping accuracy nearly unchanged, gives up to 15.8× speedup on the self-attention operator versus dense FlashAttention2, and yields ~1.4× speedup versus the same sparse base method. It uses an INT4 paged K cache and GPU-friendly topp kernels and targets serving scenarios with large batch sizes or CPU offload.
Problem Statement
Most sparse attention systems pick a fixed number of tokens (top-k) to compute attention. Fixed budgets either over-select (waste memory/bandwidth) or under-select (hurt accuracy) because attention weight shapes vary across heads, layers, and queries. Picking a single k requires offline tuning per algorithm and model and fails at runtime dynamism.
Main Contribution
Propose using top-p (topp) pruning for attention: pick the minimum tokens whose estimated attention sum ≥ p.
Design Twilight, a select-then-prune wrapper that can be applied to existing sparse attention algorithms to make budget decisions adaptive.
Implement practical system pieces: INT4 paged K cache, SpGEMV kernel, and a binary-search topp kernel, and demonstrate large speedups on serving workloads.
Key Findings
Twilight can prune most redundant tokens after a conservative selection step.
Large operator speedups versus dense attention and gains over base sparse methods.
Accuracy is preserved in evaluated benchmarks after pruning.
INT4 quantized K cache is a practical tradeoff for topp estimation.
Results
Max self-attention operator speedup
Speedup over same base sparse method
End-to-end decoding speedup (best reported)
Maximum pruning of selected tokens
Accuracy
INT4 quantization stability
Who Should Care
What To Try In 7 Days
Wrap an existing sparse attention implementation (Quest or FlashInfer) with the select-then-prune step and run end-to-end tests.
Calibrate topp threshold p on a small validation set (start at p≈0.85) and measure TPOT and perplexity.
Add an INT4 paged K cache and run SpGEMV kernels to measure bandwidth savings and latency.
Optimization Features
Token Efficiency
- Head/group-wise varlen budgets via topp
- Prune to cumulative attention mass p
Infra Optimization
- Avoids loading full KV for most tokens
- Better gains when KV offload from CPU dominates
System Optimization
- Paged INT4 cache layout to align with KV pages
- Load-balancing by flattening head dimension
- Kernel fusion for elementwise ops in topp
Inference Optimization
- Adaptive top-p (topp) token pruning
- Hierarchical select-then-prune pipeline
- INT4 paged K cache to reduce memory bandwidth
- GPU-friendly binary-search topp kernel and SpGEMV
Reproducibility
Code Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Twilight adds an estimation/pruner overhead that can offset gains for single-request or tiny-batch cases.
- Head-wise dynamism conflicts with Group Query Attention (GQA), requiring group-level compromises.
- Maintains an extra INT4 K cache, adding ~1/8 extra KV memory in some deployments.
- Topp requires higher numeric precision than top-k, so overly aggressive quantization (e.g., 2-bit) harms stability.
When Not To Use
- Low-latency single-request workloads where pruner overhead dominates.
- Models or kernels that cannot provide per-head attention weight estimates.
- Scenarios where the system cannot afford the extra INT4 cache memory.
Failure Modes
- Estimation overhead outweighs sparse-attention savings on small batches.
- Too-low quantization (≤2-bit) underestimates attention mass and breaks pruning.
- GQA architectures force group budgets that reduce head-wise pruning benefits.
- Mis-tuned p can under- or over-prune and reduce accuracy or waste work.
Core Entities
Models
- Longchat-7B-v1.5-32k
- LLaMA2-7B-Chat
- LLaMA-3.1-8B-Instruct
Metrics
- self-attention speedup
- end-to-end decoding speedup
- perplexity
- average benchmark score
- TPOT (time per output token)
Datasets
- Longbench
- RULER
- GSM8K
- COQA
- PG-19
Benchmarks
- Longbench
- RULER
- PG-19
- GSM8K
- COQA

