Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.85
Citation Count
1
Why It Matters For Business
If your product uses chat or multi-turn flows, caching KV states and overlapping cache IO with GPU work can cut latency and cloud GPU costs dramatically.
Summary TLDR
CachedAttention saves and reuses transformer key-value (KV) caches across multi-turn chat sessions by storing them in a tiered AttentionStore (DRAM and SSD). It overlaps IO with GPU work (layer-wise pre-loading and asynchronous saving), uses scheduler-aware prefetch/eviction, and decouples positional encoding so caches remain valid after truncation. On ShareGPT workloads and open models (LLaMA, Falcon, Mistral) it cut time-to-first-token by up to 87%, raised prefilling throughput up to 7.8×, and reduced end-to-end cost up to 70% versus recomputation.
Problem Statement
Serving multi-turn chat is inefficient because GPUs repeatedly recompute large KV caches for historical tokens. ShareGPT analysis shows 73% of conversations are multi-turn and up to 99% of a new-turn prefilling cost can come from recomputing historical tokens. This wastes GPU time, HBM space, and cloud money.
Main Contribution
CachedAttention: an attention variant that reuses saved KV caches across conversation turns to avoid repeated recomputation.
AttentionStore: a hierarchical KV cache (HBM buffer, host DRAM, SSD) with scheduler-aware prefetch and eviction.
Overlap IO and compute via layer-wise pre-loading and asynchronous saving to hide KV transfer latency.
Decoupled positional encoding for KV caches so truncation keeps saved caches valid and avoids recompute.
Comprehensive experiments on ShareGPT with LLaMA/Falcon/Mistral showing large latency, throughput, and cost gains.
Key Findings
Time-to-first-token (TTFT) drops dramatically when cached KV hits occur
Prefilling throughput increases by several× because only new tokens are prefilling
End-to-end inference cost falls because GPU hours drop
AttentionStore can achieve high cache hit rates across models
Decoupling positional encoding preserves model quality after truncation
Results
Time-to-first-token (TTFT)
Prefilling throughput
End-to-end inference cost
AttentionStore cache hit rate
GPU time speedup
Perplexity (PPL) after truncation
Who Should Care
What To Try In 7 Days
Measure prefilling share of request latency on your logs (is historical token recompute dominant?).
Prototype saving KV cache for active sessions to host DRAM and test simple reuse for a sample workload.
Add a read buffer and layer-wise preloading to hide KV load latency for cached sessions.
Optimization Features
Token Efficiency
- support for KV truncation without recompute
Infra Optimization
- move cold KV caches to SSD to trade storage cost for GPU compute
System Optimization
- hierarchical AttentionStore (DRAM + SSD)
- HBM read/write buffers to overlap IO
Inference Optimization
- KV cache reuse across turns
- layer-wise pre-loading
- asynchronous KV saving
- scheduler-aware prefetch/evict
- decoupled positional encoding (RPE)
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires models that use relative position encodings (RPE) or changing model internals to decouple positions.
- Needs extra host DRAM and SSD capacity; storage cost and management matter for large models.
- Effectiveness depends on session reuse and KV size; large-KV models need proportionally more storage.
- Scheduler integration and correct prefetch window tuning are needed to avoid wasted IO.
When Not To Use
- Workloads dominated by single-turn requests with little session reuse.
- Environments without host DRAM/SSD access or with extremely tight end-to-end tail-latency SLAs that cannot tolerate prefetch variability.
- Small models where KV recompute is already cheap and HBM capacity suffices.
Failure Modes
- Low cache hit rate when distinct-session arrival rate is high, reducing benefits.
- Insufficient read/write buffer sizes can make prefetching fail and block GPU execution.
- Model or tokenizer changes that mismatch stored KV format can corrupt reuse.
- Disk/DRAM IO spikes can degrade latency if eviction/prefetch logic misbehaves.
Core Entities
Models
- LLaMA-1 65B
- LLaMA-2 13B
- LLaMA-2 70B
- Falcon-40B
- Mistral-7B
Metrics
- Time-to-first-token (TTFT)
- Prefilling throughput
- GPU time
- Cache hit rate
- Perplexity (PPL)
- Accuracy
Datasets
- ShareGPT
Benchmarks
- MMLU
- LongEval
- PIQA
- WikiText-2
- C4
- PTB
Context Entities
Datasets
- ShareGPT (HuggingFace dataset)
Benchmarks
- MMLU
- LongEval
- PIQA

