Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
If you serve large LLMs on GPUs, Double Sparsity cuts KV-cache bandwidth and GPU memory use, delivering multi× attention speedups and up to ~2× end-to-end throughput without retraining.
Summary TLDR
Double Sparsity is a post-training method that combines token sparsity (use only important tokens) with channel sparsity (use a small set of feature channels) to avoid full attention computation and reduce KV-cache traffic. Using offline calibration and a small 4-bit label cache, it keeps the full KV cache but reads only important parts, achieving ~1/16 token+channel sparsity with negligible accuracy loss on tests. On GPUs it speeds up attention kernels by ~4–16× and end-to-end inference up to ~1.9×; offloading mode reduces GPU KV memory to 1/16 and gives up to ~16× throughput vs FlexGen for very long contexts. Code is public.
Problem Statement
Inference is memory-bound because KV cache accesses dominate decoding cost at long contexts. Existing post-training sparse attention either loses accuracy or fails to get wall-clock speedups due to poor token selection or non-contiguous memory access. The problem: pick important tokens cheaply at runtime so attention reads far less KV cache and runs faster without retraining.
Main Contribution
Double Sparsity: combine token sparsity and channel sparsity to select important tokens cheaply at runtime.
Offline calibration to find stable 'outlier' channels and a 4-bit label cache for contiguous reads and low bandwidth use.
Double Sparsity-Offload: store full KV on CPU and prefetch only important tokens to GPU using a double buffer, cutting GPU KV memory to 1/16.
Extensive experiments on Llama/Mistral/Mixtral variants showing minimal perplexity loss at 1/16 sparsity and large speedups on A10G/A100.
Key Findings
Double Sparsity keeps accuracy nearly unchanged at a combined token+channel sparsity of 1/16.
Attention operator latency falls dramatically with Double Sparsity.
End-to-end decoding throughput improves with minimal engineering changes.
Offloading mode reduces GPU KV memory and scales to extreme lengths.
Results
Perplexity (Llama-2-7B)
Attention operator speedup
End-to-end throughput (Llama-2-7B)
Offload throughput vs FlexGen (very long context)
GPU KV cache memory
Who Should Care
What To Try In 7 Days
Run offline calibration on a small validation set to get outlier channels.
Implement the label cache and test 1/16 sparsity on your Llama-2-7B workload; compare perplexity and tokens/s.
If GPU memory is tight, test Double Sparsity-Offload with double buffering on a long-context workload.
Optimization Features
Token Efficiency
- compute attention on ~1/16 tokens at runtime
Infra Optimization
- Triton kernels for top-k attention
- uses asynchronous CUDA streams and DGL gather kernel
Model Optimization
- post-training sparse attention
- channel sparsity (offline-calibrated outlier channels)
System Optimization
- contiguous memory layout via label cache
- overlap compute with async CPU→GPU copies (double buffer)
Training Optimization
- none (post-training, no retraining required)
Inference Optimization
- token sparsity (top-k token attention)
- label cache (4-bit) for contiguous reads
- KV offload with double-buffer prefetch
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance depends on quality of offline calibration and layer embedding similarity.
- Small-batch or very short-context workloads get smaller speedups due to kernel launch overheads.
- Some architectures (noted GQA variants) are incompatible with certain outlier choices.
When Not To Use
- When you need zero-perplexity change and cannot accept any accuracy delta beyond strict baselines.
- For tiny batch or very short contexts where kernel launch dominates runtime.
- On architectures where outlier-channel selection is ineffective or unsupported.
Failure Modes
- Accuracy drops sharply when pushing sparsity beyond 1/16 (e.g., 1/32 shows large perplexity rise).
- Insufficient overlap of communication and compute can negate offload gains.
- Calibration mismatch for early or last layers reduces token selection quality.
Core Entities
Models
- Llama-2-7B
- Llama-2-70B
- Llama-7B
- Mistral-7B
- Mixtral-8x7B
- Vicuna-7B-16K
Metrics
- perplexity
- attention operator speedup
- end-to-end throughput (tokens/s)
- GPU KV memory reduction
Datasets
- Wiki-2
- MultifieldQA
- GovReport
- TriviaQA
- MMLU
- The Pile (validation)
Benchmarks
- wiki-2 perplexity
- key-value retrieval
- long context benchmarks
- MMLU

