ChunkAttention: share KV cache by chunking prompt prefixes to speed self-attention 3.2–4.8×

February 23, 20247 min

Overview

Decision SnapshotReady For Pilot

Solid systems idea with open-source code and A100-validated speedups. Gains rely on workloads with shared prefix prompts and require per-hardware tuning.

Citations1

Evidence Strength0.80

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 80%

Production readiness: 85%

Novelty: 70%

Authors

Lu Ye, Ze Tao, Yong Huang, Yang Li

Links

Abstract / PDF / Code

Why It Matters For Business

If many requests reuse the same system prompt, ChunkAttention cuts attention latency and KV memory dramatically, letting you serve more users from the same GPUs or reduce cloud costs.

Who Should Care

Summary TLDR

ChunkAttention reorganizes KV cache into a prefix tree of small chunks so requests that share the same starting prompt can reuse keys/values in memory. It adds a two-phase attention kernel (chunk-first + sequence-first) tuned for that storage. On A100 GPUs it speeds up the self-attention kernel 3.2–4.8× for shared system prompts of 1K–4K tokens and cuts KV cache memory by ~70–90% end-to-end. No regression when no shared prefix. Code is public.

Problem Statement

Self-attention during inference is memory-bound because KV cache grows with context length and wastes memory when many requests share the same system prompt. This limits batch size and throughput (e.g., GPT-3 KV cache per token ~4.5MB, an 8×A100 server holds ~70k tokens). Existing fixes are static or wasteful. We need a runtime method to detect and share identical prompt prefixes and an attention kernel that benefits from that sharing.

Main Contribution

Prefix-aware KV cache (PAKV): slice keys/values into fixed-size chunks and store them in a prefix tree so identical prompt prefixes can share memory at runtime.

Two-phase partition (TPP) attention kernel: a chunk-first phase batches queries over shared chunks, then a sequence-first phase completes per-sequence work to improve data locality.

Key Findings

Self-attention kernel becomes 3.2–4.8× faster on A100 when many requests share long prompt prefixes.

Numberskernel speedup 3.24.8× (ns=1024..4096)

Practical UseIf your workload uses long shared system prompts, enable ChunkAttention to cut attention latency by multiple times.

Evidence RefAbstract, Table 3, Figure 3

End-to-end serving reduced KV cache memory by about 70–90% when long prefixes are shared.

Numberspeak KV cache reduced 70%–90% (Table 4)

Practical UseYou can host many more concurrent sequences or lower GPU memory needs when system prompts are reused.

Evidence RefTable 4 (KV cache GB numbers)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Self-attention kernel speedup3.24.8×PagedAttention / FlashAttn implementationsup to 4.8× faster (ns=1024..4096)microkernel (A100, c=64, b=32)Table 3, Figure 3Table 3
End-to-end throughput (token rate)1.6× at ns=1024; 2.3× at ns=2048vLLM1.6× and 2.3× improvementsChunkLlama vs vLLM (OpenLlama2-7B, FP16)End-to-end evaluation, Figure 5 and textEnd-to-end evaluation section

What To Try In 7 Days

Measure how often system prompts are identical in your requests (tokenized count).

Run ChunkAttention microkernel on a dev A100 with c=64 as a drop-in to check kernel speedup.

Integrate prefix-aware KV cache into a test endpoint and compare peak KV memory and normalized latency versus vLLM/TGI.

Optimization Features

Token Efficiency
shared KV storage reduces per-token memory
Infra Optimization
A100-specific tuning, exploit tensor cores
System Optimization
prefix-tree KV layoutpool-based chunk allocatorlazy CPU→GPU context copy
Inference Optimization
two-phase partition kernelbatching queries for shared chunksGPU-tuned CUDA implementation

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseApache-2.0

Risks & Boundaries

Limitations

Shared prompt must appear at the start of sequences to get memory sharing.

Gains shrink as sequences diverge during decoding.

When Not To Use

If system prompt is not placed at sequence start or differs across requests.

If you fine-tune and deploy separate model instances per application.

Failure Modes

Little or no speedup when few requests share prefixes or prompt position varies.

CPU-to-GPU context copy or prefix-tree updates may add overhead if tree changes every iteration.

Core Entities

Models

Llama2-7BOpenLlama2-7B

Metrics

kernel latency (µs)token rate (tokens/s)normalized latency (ms/token)peak KV cache memory (GB)

Datasets

ScienceQATabMWPChameleon prompts (example workloads referenced)

Benchmarks

microkernel throughput testsend-to-end GPT-style serving workload