Save and reuse attention KV caches across turns to cut LLM serving latency and cloud cost

Overview

Decision SnapshotNeeds Validation

The design integrates with common serving stacks (PyTorch/Transformers) and uses practical storage tiers; main risks are IO bandwidth and integration complexity with schedulers.

Citations1

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 80%

Novelty: 60%

Authors

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo

Links

Abstract / PDF / Data

Why It Matters For Business

If your product uses chat or multi-turn flows, caching KV states and overlapping cache IO with GPU work can cut latency and cloud GPU costs dramatically.

Who Should Care

CTO Product Manager Engineering Lead ML Engineer Data Scientist

Summary TLDR

CachedAttention saves and reuses transformer key-value (KV) caches across multi-turn chat sessions by storing them in a tiered AttentionStore (DRAM and SSD). It overlaps IO with GPU work (layer-wise pre-loading and asynchronous saving), uses scheduler-aware prefetch/eviction, and decouples positional encoding so caches remain valid after truncation. On ShareGPT workloads and open models (LLaMA, Falcon, Mistral) it cut time-to-first-token by up to 87%, raised prefilling throughput up to 7.8×, and reduced end-to-end cost up to 70% versus recomputation.

Problem Statement

Serving multi-turn chat is inefficient because GPUs repeatedly recompute large KV caches for historical tokens. ShareGPT analysis shows 73% of conversations are multi-turn and up to 99% of a new-turn prefilling cost can come from recomputing historical tokens. This wastes GPU time, HBM space, and cloud money.

Main Contribution

CachedAttention: an attention variant that reuses saved KV caches across conversation turns to avoid repeated recomputation.

AttentionStore: a hierarchical KV cache (HBM buffer, host DRAM, SSD) with scheduler-aware prefetch and eviction.

Key Findings

Time-to-first-token (TTFT) drops dramatically when cached KV hits occur

NumbersTTFT reduced by up to 87% (Figure 14)

Practical UseIf your workload reuses sessions, expect users to see first output much faster; prioritize caching for chat apps.

Evidence RefFigure 14

Prefilling throughput increases by several× because only new tokens are prefilling

NumbersPrefill throughput improved up to 7.8× (Figure 15)

Practical UseYou can serve more prompt-prefill work per GPU; reduce scaling needs for bursty multi-turn traffic.

Evidence RefFigure 15

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Time-to-first-token (TTFT)	85%–87% reduction	recomputation	-85% to -87%	ShareGPT workload; LLaMA-13B/65B/70B, Falcon-40B (Figure 14)	Figure 14	Figure 14
Prefilling throughput	2.6×–7.8×	recomputation	+2.6× to +7.8×	ShareGPT workload; various models (Figure 15)	Figure 15	Figure 15

What To Try In 7 Days

Measure prefilling share of request latency on your logs (is historical token recompute dominant?).

Prototype saving KV cache for active sessions to host DRAM and test simple reuse for a sample workload.

Add a read buffer and layer-wise preloading to hide KV load latency for cached sessions.

Optimization Features

Token Efficiency

support for KV truncation without recompute

Infra Optimization

move cold KV caches to SSD to trade storage cost for GPU compute

System Optimization

hierarchical AttentionStore (DRAM + SSD)HBM read/write buffers to overlap IO

Inference Optimization

KV cache reuse across turnslayer-wise pre-loadingasynchronous KV savingscheduler-aware prefetch/evictdecoupled positional encoding (RPE)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/philschmid/sharegpt-raw

Risks & Boundaries

Limitations

Requires models that use relative position encodings (RPE) or changing model internals to decouple positions.

Needs extra host DRAM and SSD capacity; storage cost and management matter for large models.

When Not To Use

Workloads dominated by single-turn requests with little session reuse.

Environments without host DRAM/SSD access or with extremely tight end-to-end tail-latency SLAs that cannot tolerate prefetch variability.

Failure Modes

Low cache hit rate when distinct-session arrival rate is high, reducing benefits.

Insufficient read/write buffer sizes can make prefetching fail and block GPU execution.

Core Entities

Models

LLaMA-1 65BLLaMA-2 13BLLaMA-2 70BFalcon-40BMistral-7B

Metrics

Time-to-first-token (TTFT)Prefilling throughputGPU timeCache hit ratePerplexity (PPL)Accuracy

Datasets

ShareGPT

Benchmarks

MMLULongEvalPIQAWikiText-2C4PTB

Context Entities

Datasets

ShareGPT (HuggingFace dataset)

Benchmarks

MMLULongEvalPIQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Time-to-first-token (TTFT) drops dramatically when cached KV hits occur

Prefilling throughput increases by several× because only new tokens are prefilling

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding