Cut KV-cache accesses and speed up LLM decoding 2×–16× with post-training 'Double Sparsity'.

August 11, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng

Links

Abstract / PDF

Why It Matters For Business

If you serve large LLMs on GPUs, Double Sparsity cuts KV-cache bandwidth and GPU memory use, delivering multi× attention speedups and up to ~2× end-to-end throughput without retraining.

Summary TLDR

Double Sparsity is a post-training method that combines token sparsity (use only important tokens) with channel sparsity (use a small set of feature channels) to avoid full attention computation and reduce KV-cache traffic. Using offline calibration and a small 4-bit label cache, it keeps the full KV cache but reads only important parts, achieving ~1/16 token+channel sparsity with negligible accuracy loss on tests. On GPUs it speeds up attention kernels by ~4–16× and end-to-end inference up to ~1.9×; offloading mode reduces GPU KV memory to 1/16 and gives up to ~16× throughput vs FlexGen for very long contexts. Code is public.

Problem Statement

Inference is memory-bound because KV cache accesses dominate decoding cost at long contexts. Existing post-training sparse attention either loses accuracy or fails to get wall-clock speedups due to poor token selection or non-contiguous memory access. The problem: pick important tokens cheaply at runtime so attention reads far less KV cache and runs faster without retraining.

Main Contribution

Double Sparsity: combine token sparsity and channel sparsity to select important tokens cheaply at runtime.

Offline calibration to find stable 'outlier' channels and a 4-bit label cache for contiguous reads and low bandwidth use.

Double Sparsity-Offload: store full KV on CPU and prefetch only important tokens to GPU using a double buffer, cutting GPU KV memory to 1/16.

Extensive experiments on Llama/Mistral/Mixtral variants showing minimal perplexity loss at 1/16 sparsity and large speedups on A10G/A100.

Key Findings

Double Sparsity keeps accuracy nearly unchanged at a combined token+channel sparsity of 1/16.

NumbersLlama-2-7B perplexity 5.47 → 5.76 at 1/16

Attention operator latency falls dramatically with Double Sparsity.

Numbersattention speedup ranges ~4×–16× depending on GPU/batch/seq

End-to-end decoding throughput improves with minimal engineering changes.

Numbersend-to-end speedup up to 1.9× on A100

Offloading mode reduces GPU KV memory and scales to extreme lengths.

NumbersGPU KV cache usage reduced to 1/16; 16.3× throughput vs FlexGen at 256K sequence

Results

Perplexity (Llama-2-7B)

Value5.47 → 5.76 at 1/16 sparsity

Baseline5.47 (original)

Attention operator speedup

Value≈4×–16×

Baselinescaled_dot_product_attention

End-to-end throughput (Llama-2-7B)

Valueup to 1.9× tokens/s

Baselinegpt-fast

Offload throughput vs FlexGen (very long context)

Valueup to 16.3×

BaselineFlexGen Offload

GPU KV cache memory

Valuereduced to 1/16

Baselinefull KV cache

Who Should Care

What To Try In 7 Days

Run offline calibration on a small validation set to get outlier channels.

Implement the label cache and test 1/16 sparsity on your Llama-2-7B workload; compare perplexity and tokens/s.

If GPU memory is tight, test Double Sparsity-Offload with double buffering on a long-context workload.

Optimization Features

Token Efficiency

  • compute attention on ~1/16 tokens at runtime

Infra Optimization

  • Triton kernels for top-k attention
  • uses asynchronous CUDA streams and DGL gather kernel

Model Optimization

  • post-training sparse attention
  • channel sparsity (offline-calibrated outlier channels)

System Optimization

  • contiguous memory layout via label cache
  • overlap compute with async CPU→GPU copies (double buffer)

Training Optimization

  • none (post-training, no retraining required)

Inference Optimization

  • token sparsity (top-k token attention)
  • label cache (4-bit) for contiguous reads
  • KV offload with double-buffer prefetch

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on quality of offline calibration and layer embedding similarity.
  • Small-batch or very short-context workloads get smaller speedups due to kernel launch overheads.
  • Some architectures (noted GQA variants) are incompatible with certain outlier choices.

When Not To Use

  • When you need zero-perplexity change and cannot accept any accuracy delta beyond strict baselines.
  • For tiny batch or very short contexts where kernel launch dominates runtime.
  • On architectures where outlier-channel selection is ineffective or unsupported.

Failure Modes

  • Accuracy drops sharply when pushing sparsity beyond 1/16 (e.g., 1/32 shows large perplexity rise).
  • Insufficient overlap of communication and compute can negate offload gains.
  • Calibration mismatch for early or last layers reduces token selection quality.

Core Entities

Models

  • Llama-2-7B
  • Llama-2-70B
  • Llama-7B
  • Mistral-7B
  • Mixtral-8x7B
  • Vicuna-7B-16K

Metrics

  • perplexity
  • attention operator speedup
  • end-to-end throughput (tokens/s)
  • GPU KV memory reduction

Datasets

  • Wiki-2
  • MultifieldQA
  • GovReport
  • TriviaQA
  • MMLU
  • The Pile (validation)

Benchmarks

  • wiki-2 perplexity
  • key-value retrieval
  • long context benchmarks
  • MMLU