ChunkAttention: share KV cache by chunking prompt prefixes to speed self-attention 3.2–4.8×

February 23, 20247 min

Overview

Production Readiness

0.85

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

1

Authors

Lu Ye, Ze Tao, Yong Huang, Yang Li

Links

Abstract / PDF

Why It Matters For Business

If many requests reuse the same system prompt, ChunkAttention cuts attention latency and KV memory dramatically, letting you serve more users from the same GPUs or reduce cloud costs.

Summary TLDR

ChunkAttention reorganizes KV cache into a prefix tree of small chunks so requests that share the same starting prompt can reuse keys/values in memory. It adds a two-phase attention kernel (chunk-first + sequence-first) tuned for that storage. On A100 GPUs it speeds up the self-attention kernel 3.2–4.8× for shared system prompts of 1K–4K tokens and cuts KV cache memory by ~70–90% end-to-end. No regression when no shared prefix. Code is public.

Problem Statement

Self-attention during inference is memory-bound because KV cache grows with context length and wastes memory when many requests share the same system prompt. This limits batch size and throughput (e.g., GPT-3 KV cache per token ~4.5MB, an 8×A100 server holds ~70k tokens). Existing fixes are static or wasteful. We need a runtime method to detect and share identical prompt prefixes and an attention kernel that benefits from that sharing.

Main Contribution

Prefix-aware KV cache (PAKV): slice keys/values into fixed-size chunks and store them in a prefix tree so identical prompt prefixes can share memory at runtime.

Two-phase partition (TPP) attention kernel: a chunk-first phase batches queries over shared chunks, then a sequence-first phase completes per-sequence work to improve data locality.

A full implementation (ChunkAttn / ChunkLlama) with CUDA kernels and system-level optimizations (lazy context copy, pool allocator).

Empirical evaluation showing large kernel speedups and significant KV memory reductions under realistic shared-prompt workloads.

Key Findings

Self-attention kernel becomes 3.2–4.8× faster on A100 when many requests share long prompt prefixes.

Numberskernel speedup 3.2–4.8× (ns=1024..4096)

End-to-end serving reduced KV cache memory by about 70–90% when long prefixes are shared.

Numberspeak KV cache reduced 70%–90% (Table 4)

ChunkAttention causes no performance regression when no tokens are shared.

Numberscomparable throughput to SOTA when ns=0 (Table 3, end-to-end results)

Throughput advantage decreases as sequences diverge during decoding but remains significant for many completion tokens.

Numbersspeedup drops from 3.6× at nc=512 to 2.3× at nc=2048 (ns=2048)

Results

Self-attention kernel speedup

Value3.2–4.8×

BaselinePagedAttention / FlashAttn implementations

End-to-end throughput (token rate)

Value1.6× at ns=1024; 2.3× at ns=2048

BaselinevLLM

Peak KV cache memory

Valuereduced by 70%–90%

BaselinevLLM

Normalized latency

Valuemaintains <40 ms/token under tested RPS with shared prefixes

BaselinevLLM / TGI

Who Should Care

What To Try In 7 Days

Measure how often system prompts are identical in your requests (tokenized count).

Run ChunkAttention microkernel on a dev A100 with c=64 as a drop-in to check kernel speedup.

Integrate prefix-aware KV cache into a test endpoint and compare peak KV memory and normalized latency versus vLLM/TGI.

Optimization Features

Token Efficiency

  • shared KV storage reduces per-token memory

Infra Optimization

  • A100-specific tuning, exploit tensor cores

System Optimization

  • prefix-tree KV layout
  • pool-based chunk allocator
  • lazy CPU→GPU context copy

Inference Optimization

  • two-phase partition kernel
  • batching queries for shared chunks
  • GPU-tuned CUDA implementation

Reproducibility

License

  • Apache-2.0

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Shared prompt must appear at the start of sequences to get memory sharing.
  • Gains shrink as sequences diverge during decoding.
  • Performance tuned for A100 and common head/dim configs; other hardware needs retuning.
  • If applications move to per-app fine-tuned models, shared-prompt opportunities decline.

When Not To Use

  • If system prompt is not placed at sequence start or differs across requests.
  • If you fine-tune and deploy separate model instances per application.
  • On hardware/configurations where the CUDA kernel has not been tuned.

Failure Modes

  • Little or no speedup when few requests share prefixes or prompt position varies.
  • CPU-to-GPU context copy or prefix-tree updates may add overhead if tree changes every iteration.
  • Atomic or serialized reductions needed for some host configs could reduce parallelism.

Core Entities

Models

  • Llama2-7B
  • OpenLlama2-7B

Metrics

  • kernel latency (µs)
  • token rate (tokens/s)
  • normalized latency (ms/token)
  • peak KV cache memory (GB)

Datasets

  • ScienceQA
  • TabMWP
  • Chameleon prompts (example workloads referenced)

Benchmarks

  • microkernel throughput tests
  • end-to-end GPT-style serving workload