Overview
The design integrates with common serving stacks (PyTorch/Transformers) and uses practical storage tiers; main risks are IO bandwidth and integration complexity with schedulers.
Citations1
Evidence Strength0.80
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 85%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
If your product uses chat or multi-turn flows, caching KV states and overlapping cache IO with GPU work can cut latency and cloud GPU costs dramatically.
Who Should Care
Summary TLDR
CachedAttention saves and reuses transformer key-value (KV) caches across multi-turn chat sessions by storing them in a tiered AttentionStore (DRAM and SSD). It overlaps IO with GPU work (layer-wise pre-loading and asynchronous saving), uses scheduler-aware prefetch/eviction, and decouples positional encoding so caches remain valid after truncation. On ShareGPT workloads and open models (LLaMA, Falcon, Mistral) it cut time-to-first-token by up to 87%, raised prefilling throughput up to 7.8×, and reduced end-to-end cost up to 70% versus recomputation.
Problem Statement
Serving multi-turn chat is inefficient because GPUs repeatedly recompute large KV caches for historical tokens. ShareGPT analysis shows 73% of conversations are multi-turn and up to 99% of a new-turn prefilling cost can come from recomputing historical tokens. This wastes GPU time, HBM space, and cloud money.
Main Contribution
CachedAttention: an attention variant that reuses saved KV caches across conversation turns to avoid repeated recomputation.
AttentionStore: a hierarchical KV cache (HBM buffer, host DRAM, SSD) with scheduler-aware prefetch and eviction.
Key Findings
Time-to-first-token (TTFT) drops dramatically when cached KV hits occur
Prefilling throughput increases by several× because only new tokens are prefilling
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Time-to-first-token (TTFT) | 85%–87% reduction | recomputation | -85% to -87% | ShareGPT workload; LLaMA-13B/65B/70B, Falcon-40B (Figure 14) | Figure 14 | Figure 14 |
| Prefilling throughput | 2.6×–7.8× | recomputation | +2.6× to +7.8× | ShareGPT workload; various models (Figure 15) | Figure 15 | Figure 15 |
What To Try In 7 Days
Measure prefilling share of request latency on your logs (is historical token recompute dominant?).
Prototype saving KV cache for active sessions to host DRAM and test simple reuse for a sample workload.
Add a read buffer and layer-wise preloading to hide KV load latency for cached sessions.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires models that use relative position encodings (RPE) or changing model internals to decouple positions.
Needs extra host DRAM and SSD capacity; storage cost and management matter for large models.
When Not To Use
Workloads dominated by single-turn requests with little session reuse.
Environments without host DRAM/SSD access or with extremely tight end-to-end tail-latency SLAs that cannot tolerate prefetch variability.
Failure Modes
Low cache hit rate when distinct-session arrival rate is high, reducing benefits.
Insufficient read/write buffer sizes can make prefetching fail and block GPU execution.

