Overview
Production Readiness
0.85
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
If many requests reuse the same system prompt, ChunkAttention cuts attention latency and KV memory dramatically, letting you serve more users from the same GPUs or reduce cloud costs.
Summary TLDR
ChunkAttention reorganizes KV cache into a prefix tree of small chunks so requests that share the same starting prompt can reuse keys/values in memory. It adds a two-phase attention kernel (chunk-first + sequence-first) tuned for that storage. On A100 GPUs it speeds up the self-attention kernel 3.2–4.8× for shared system prompts of 1K–4K tokens and cuts KV cache memory by ~70–90% end-to-end. No regression when no shared prefix. Code is public.
Problem Statement
Self-attention during inference is memory-bound because KV cache grows with context length and wastes memory when many requests share the same system prompt. This limits batch size and throughput (e.g., GPT-3 KV cache per token ~4.5MB, an 8×A100 server holds ~70k tokens). Existing fixes are static or wasteful. We need a runtime method to detect and share identical prompt prefixes and an attention kernel that benefits from that sharing.
Main Contribution
Prefix-aware KV cache (PAKV): slice keys/values into fixed-size chunks and store them in a prefix tree so identical prompt prefixes can share memory at runtime.
Two-phase partition (TPP) attention kernel: a chunk-first phase batches queries over shared chunks, then a sequence-first phase completes per-sequence work to improve data locality.
A full implementation (ChunkAttn / ChunkLlama) with CUDA kernels and system-level optimizations (lazy context copy, pool allocator).
Empirical evaluation showing large kernel speedups and significant KV memory reductions under realistic shared-prompt workloads.
Key Findings
Self-attention kernel becomes 3.2–4.8× faster on A100 when many requests share long prompt prefixes.
End-to-end serving reduced KV cache memory by about 70–90% when long prefixes are shared.
ChunkAttention causes no performance regression when no tokens are shared.
Throughput advantage decreases as sequences diverge during decoding but remains significant for many completion tokens.
Results
Self-attention kernel speedup
End-to-end throughput (token rate)
Peak KV cache memory
Normalized latency
Who Should Care
What To Try In 7 Days
Measure how often system prompts are identical in your requests (tokenized count).
Run ChunkAttention microkernel on a dev A100 with c=64 as a drop-in to check kernel speedup.
Integrate prefix-aware KV cache into a test endpoint and compare peak KV memory and normalized latency versus vLLM/TGI.
Optimization Features
Token Efficiency
- shared KV storage reduces per-token memory
Infra Optimization
- A100-specific tuning, exploit tensor cores
System Optimization
- prefix-tree KV layout
- pool-based chunk allocator
- lazy CPU→GPU context copy
Inference Optimization
- two-phase partition kernel
- batching queries for shared chunks
- GPU-tuned CUDA implementation
Reproducibility
License
- Apache-2.0
Code Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Shared prompt must appear at the start of sequences to get memory sharing.
- Gains shrink as sequences diverge during decoding.
- Performance tuned for A100 and common head/dim configs; other hardware needs retuning.
- If applications move to per-app fine-tuned models, shared-prompt opportunities decline.
When Not To Use
- If system prompt is not placed at sequence start or differs across requests.
- If you fine-tune and deploy separate model instances per application.
- On hardware/configurations where the CUDA kernel has not been tuned.
Failure Modes
- Little or no speedup when few requests share prefixes or prompt position varies.
- CPU-to-GPU context copy or prefix-tree updates may add overhead if tree changes every iteration.
- Atomic or serialized reductions needed for some host configs could reduce parallelism.
Core Entities
Models
- Llama2-7B
- OpenLlama2-7B
Metrics
- kernel latency (µs)
- token rate (tokens/s)
- normalized latency (ms/token)
- peak KV cache memory (GB)
Datasets
- ScienceQA
- TabMWP
- Chameleon prompts (example workloads referenced)
Benchmarks
- microkernel throughput tests
- end-to-end GPT-style serving workload

