Cut KV-cache accesses and speed up LLM decoding 2×–16× with post-training 'Double Sparsity'.

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

Authors

Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng

Links

Abstract / PDF

Why It Matters For Business

If you serve large LLMs on GPUs, Double Sparsity cuts KV-cache bandwidth and GPU memory use, delivering multi× attention speedups and up to ~2× end-to-end throughput without retraining.

Summary TLDR

Double Sparsity is a post-training method that combines token sparsity (use only important tokens) with channel sparsity (use a small set of feature channels) to avoid full attention computation and reduce KV-cache traffic. Using offline calibration and a small 4-bit label cache, it keeps the full KV cache but reads only important parts, achieving ~1/16 token+channel sparsity with negligible accuracy loss on tests. On GPUs it speeds up attention kernels by ~4–16× and end-to-end inference up to ~1.9×; offloading mode reduces GPU KV memory to 1/16 and gives up to ~16× throughput vs FlexGen for very long contexts. Code is public.

Problem Statement

Inference is memory-bound because KV cache accesses dominate decoding cost at long contexts. Existing post-training sparse attention either loses accuracy or fails to get wall-clock speedups due to poor token selection or non-contiguous memory access. The problem: pick important tokens cheaply at runtime so attention reads far less KV cache and runs faster without retraining.

Main Contribution

Double Sparsity: combine token sparsity and channel sparsity to select important tokens cheaply at runtime.

Offline calibration to find stable 'outlier' channels and a 4-bit label cache for contiguous reads and low bandwidth use.

Double Sparsity-Offload: store full KV on CPU and prefetch only important tokens to GPU using a double buffer, cutting GPU KV memory to 1/16.

Extensive experiments on Llama/Mistral/Mixtral variants showing minimal perplexity loss at 1/16 sparsity and large speedups on A10G/A100.

Key Findings

Double Sparsity keeps accuracy nearly unchanged at a combined token+channel sparsity of 1/16.

NumbersLlama-2-7B perplexity 5.47 → 5.76 at 1/16

Attention operator latency falls dramatically with Double Sparsity.

Numbersattention speedup ranges ~4×–16× depending on GPU/batch/seq

End-to-end decoding throughput improves with minimal engineering changes.

Numbersend-to-end speedup up to 1.9× on A100

Offloading mode reduces GPU KV memory and scales to extreme lengths.

NumbersGPU KV cache usage reduced to 1/16; 16.3× throughput vs FlexGen at 256K sequence

Results

Perplexity (Llama-2-7B)

Value5.47 → 5.76 at 1/16 sparsity

Baseline5.47 (original)

Attention operator speedup

Value≈4×–16×

Baselinescaled_dot_product_attention

End-to-end throughput (Llama-2-7B)

Valueup to 1.9× tokens/s

Baselinegpt-fast

Offload throughput vs FlexGen (very long context)

Valueup to 16.3×

BaselineFlexGen Offload

GPU KV cache memory

Valuereduced to 1/16

Baselinefull KV cache

Who Should Care

CtoMl EngineerEngineering LeadProduct ManagerFounder

What To Try In 7 Days

Run offline calibration on a small validation set to get outlier channels.

Implement the label cache and test 1/16 sparsity on your Llama-2-7B workload; compare perplexity and tokens/s.

If GPU memory is tight, test Double Sparsity-Offload with double buffering on a long-context workload.

Optimization Features

Token Efficiency

compute attention on ~1/16 tokens at runtime

Infra Optimization

Triton kernels for top-k attention
uses asynchronous CUDA streams and DGL gather kernel

Model Optimization

post-training sparse attention
channel sparsity (offline-calibrated outlier channels)

System Optimization

contiguous memory layout via label cache
overlap compute with async CPU→GPU copies (double buffer)

Training Optimization

none (post-training, no retraining required)

Inference Optimization

token sparsity (top-k token attention)
label cache (4-bit) for contiguous reads
KV offload with double-buffer prefetch

Reproducibility

Code Urls

https://github.com/andy-yang-1/DoubleSparse

Code Available

Data Available

Open Source Status

partial

Risks & Boundaries

Limitations

Performance depends on quality of offline calibration and layer embedding similarity.
Small-batch or very short-context workloads get smaller speedups due to kernel launch overheads.
Some architectures (noted GQA variants) are incompatible with certain outlier choices.

When Not To Use

When you need zero-perplexity change and cannot accept any accuracy delta beyond strict baselines.
For tiny batch or very short contexts where kernel launch dominates runtime.
On architectures where outlier-channel selection is ineffective or unsupported.

Failure Modes

Accuracy drops sharply when pushing sparsity beyond 1/16 (e.g., 1/32 shows large perplexity rise).
Insufficient overlap of communication and compute can negate offload gains.
Calibration mismatch for early or last layers reduces token selection quality.

Core Entities

Models

Llama-2-7B
Llama-2-70B
Llama-7B
Mistral-7B
Mixtral-8x7B
Vicuna-7B-16K

Metrics

perplexity
attention operator speedup
end-to-end throughput (tokens/s)
GPU KV memory reduction

Datasets

Wiki-2
MultifieldQA
GovReport
TriviaQA
MMLU
The Pile (validation)

Benchmarks

wiki-2 perplexity
key-value retrieval
long context benchmarks
MMLU