Save and reuse attention KV caches across turns to cut LLM serving latency and cloud cost

March 23, 20247 min

Overview

Decision SnapshotNeeds Validation

The design integrates with common serving stacks (PyTorch/Transformers) and uses practical storage tiers; main risks are IO bandwidth and integration complexity with schedulers.

Citations1

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 80%

Novelty: 60%

Authors

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo

Links

Abstract / PDF / Data

Why It Matters For Business

If your product uses chat or multi-turn flows, caching KV states and overlapping cache IO with GPU work can cut latency and cloud GPU costs dramatically.

Who Should Care

Summary TLDR

CachedAttention saves and reuses transformer key-value (KV) caches across multi-turn chat sessions by storing them in a tiered AttentionStore (DRAM and SSD). It overlaps IO with GPU work (layer-wise pre-loading and asynchronous saving), uses scheduler-aware prefetch/eviction, and decouples positional encoding so caches remain valid after truncation. On ShareGPT workloads and open models (LLaMA, Falcon, Mistral) it cut time-to-first-token by up to 87%, raised prefilling throughput up to 7.8×, and reduced end-to-end cost up to 70% versus recomputation.

Problem Statement

Serving multi-turn chat is inefficient because GPUs repeatedly recompute large KV caches for historical tokens. ShareGPT analysis shows 73% of conversations are multi-turn and up to 99% of a new-turn prefilling cost can come from recomputing historical tokens. This wastes GPU time, HBM space, and cloud money.

Main Contribution

CachedAttention: an attention variant that reuses saved KV caches across conversation turns to avoid repeated recomputation.

AttentionStore: a hierarchical KV cache (HBM buffer, host DRAM, SSD) with scheduler-aware prefetch and eviction.

Key Findings

Time-to-first-token (TTFT) drops dramatically when cached KV hits occur

NumbersTTFT reduced by up to 87% (Figure 14)

Practical UseIf your workload reuses sessions, expect users to see first output much faster; prioritize caching for chat apps.

Evidence RefFigure 14

Prefilling throughput increases by several× because only new tokens are prefilling

NumbersPrefill throughput improved up to 7.8× (Figure 15)

Practical UseYou can serve more prompt-prefill work per GPU; reduce scaling needs for bursty multi-turn traffic.

Evidence RefFigure 15

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Time-to-first-token (TTFT)85%–87% reductionrecomputation-85% to -87%ShareGPT workload; LLaMA-13B/65B/70B, Falcon-40B (Figure 14)Figure 14Figure 14
Prefilling throughput2.6×–7.8×recomputation+2.6× to +7.8×ShareGPT workload; various models (Figure 15)Figure 15Figure 15

What To Try In 7 Days

Measure prefilling share of request latency on your logs (is historical token recompute dominant?).

Prototype saving KV cache for active sessions to host DRAM and test simple reuse for a sample workload.

Add a read buffer and layer-wise preloading to hide KV load latency for cached sessions.

Optimization Features

Token Efficiency
support for KV truncation without recompute
Infra Optimization
move cold KV caches to SSD to trade storage cost for GPU compute
System Optimization
hierarchical AttentionStore (DRAM + SSD)HBM read/write buffers to overlap IO
Inference Optimization
KV cache reuse across turnslayer-wise pre-loadingasynchronous KV savingscheduler-aware prefetch/evictdecoupled positional encoding (RPE)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires models that use relative position encodings (RPE) or changing model internals to decouple positions.

Needs extra host DRAM and SSD capacity; storage cost and management matter for large models.

When Not To Use

Workloads dominated by single-turn requests with little session reuse.

Environments without host DRAM/SSD access or with extremely tight end-to-end tail-latency SLAs that cannot tolerate prefetch variability.

Failure Modes

Low cache hit rate when distinct-session arrival rate is high, reducing benefits.

Insufficient read/write buffer sizes can make prefetching fail and block GPU execution.

Core Entities

Models

LLaMA-1 65BLLaMA-2 13BLLaMA-2 70BFalcon-40BMistral-7B

Metrics

Time-to-first-token (TTFT)Prefilling throughputGPU timeCache hit ratePerplexity (PPL)Accuracy

Datasets

ShareGPT

Benchmarks

MMLULongEvalPIQAWikiText-2C4PTB

Context Entities

Datasets

ShareGPT (HuggingFace dataset)

Benchmarks

MMLULongEvalPIQA