ChunkAttention: share KV cache by chunking prompt prefixes to speed self-attention 3.2–4.8×

Overview

Decision SnapshotReady For Pilot

Solid systems idea with open-source code and A100-validated speedups. Gains rely on workloads with shared prefix prompts and require per-hardware tuning.

Citations1

Evidence Strength0.80

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 80%

Production readiness: 85%

Novelty: 70%

Authors

Lu Ye, Ze Tao, Yong Huang, Yang Li

Links

Abstract / PDF / Code

Why It Matters For Business

If many requests reuse the same system prompt, ChunkAttention cuts attention latency and KV memory dramatically, letting you serve more users from the same GPUs or reduce cloud costs.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Founder

Summary TLDR

ChunkAttention reorganizes KV cache into a prefix tree of small chunks so requests that share the same starting prompt can reuse keys/values in memory. It adds a two-phase attention kernel (chunk-first + sequence-first) tuned for that storage. On A100 GPUs it speeds up the self-attention kernel 3.2–4.8× for shared system prompts of 1K–4K tokens and cuts KV cache memory by ~70–90% end-to-end. No regression when no shared prefix. Code is public.

Problem Statement

Self-attention during inference is memory-bound because KV cache grows with context length and wastes memory when many requests share the same system prompt. This limits batch size and throughput (e.g., GPT-3 KV cache per token ~4.5MB, an 8×A100 server holds ~70k tokens). Existing fixes are static or wasteful. We need a runtime method to detect and share identical prompt prefixes and an attention kernel that benefits from that sharing.

Main Contribution

Prefix-aware KV cache (PAKV): slice keys/values into fixed-size chunks and store them in a prefix tree so identical prompt prefixes can share memory at runtime.

Two-phase partition (TPP) attention kernel: a chunk-first phase batches queries over shared chunks, then a sequence-first phase completes per-sequence work to improve data locality.

Key Findings

Self-attention kernel becomes 3.2–4.8× faster on A100 when many requests share long prompt prefixes.

Numberskernel speedup 3.2–4.8× (ns=1024..4096)

Practical UseIf your workload uses long shared system prompts, enable ChunkAttention to cut attention latency by multiple times.

Evidence RefAbstract, Table 3, Figure 3

End-to-end serving reduced KV cache memory by about 70–90% when long prefixes are shared.

Numberspeak KV cache reduced 70%–90% (Table 4)

Practical UseYou can host many more concurrent sequences or lower GPU memory needs when system prompts are reused.

Evidence RefTable 4 (KV cache GB numbers)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Self-attention kernel speedup	3.2–4.8×	PagedAttention / FlashAttn implementations	up to 4.8× faster (ns=1024..4096)	microkernel (A100, c=64, b=32)	Table 3, Figure 3	Table 3
End-to-end throughput (token rate)	1.6× at ns=1024; 2.3× at ns=2048	vLLM	1.6× and 2.3× improvements	ChunkLlama vs vLLM (OpenLlama2-7B, FP16)	End-to-end evaluation, Figure 5 and text	End-to-end evaluation section

What To Try In 7 Days

Measure how often system prompts are identical in your requests (tokenized count).

Run ChunkAttention microkernel on a dev A100 with c=64 as a drop-in to check kernel speedup.

Integrate prefix-aware KV cache into a test endpoint and compare peak KV memory and normalized latency versus vLLM/TGI.

Optimization Features

Token Efficiency

shared KV storage reduces per-token memory

Infra Optimization

A100-specific tuning, exploit tensor cores

System Optimization

prefix-tree KV layoutpool-based chunk allocatorlazy CPU→GPU context copy

Inference Optimization

two-phase partition kernelbatching queries for shared chunksGPU-tuned CUDA implementation

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseApache-2.0

Code URLs

https://github.com/microsoft/chunk-attention

Risks & Boundaries

Limitations

Shared prompt must appear at the start of sequences to get memory sharing.

Gains shrink as sequences diverge during decoding.

When Not To Use

If system prompt is not placed at sequence start or differs across requests.

If you fine-tune and deploy separate model instances per application.

Failure Modes

Little or no speedup when few requests share prefixes or prompt position varies.

CPU-to-GPU context copy or prefix-tree updates may add overhead if tree changes every iteration.

Core Entities

Models

Llama2-7BOpenLlama2-7B

Metrics

kernel latency (µs)token rate (tokens/s)normalized latency (ms/token)peak KV cache memory (GB)

Datasets

ScienceQATabMWPChameleon prompts (example workloads referenced)

Benchmarks

microkernel throughput testsend-to-end GPT-style serving workload

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Self-attention kernel becomes 3.2–4.8× faster on A100 when many requests share long prompt prefixes.

End-to-end serving reduced KV cache memory by about 70–90% when long prefixes are shared.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding