Save and reuse attention KV caches across turns to cut LLM serving latency and cloud cost

March 23, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.85

Citation Count

1

Authors

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo

Links

Abstract / PDF

Why It Matters For Business

If your product uses chat or multi-turn flows, caching KV states and overlapping cache IO with GPU work can cut latency and cloud GPU costs dramatically.

Summary TLDR

CachedAttention saves and reuses transformer key-value (KV) caches across multi-turn chat sessions by storing them in a tiered AttentionStore (DRAM and SSD). It overlaps IO with GPU work (layer-wise pre-loading and asynchronous saving), uses scheduler-aware prefetch/eviction, and decouples positional encoding so caches remain valid after truncation. On ShareGPT workloads and open models (LLaMA, Falcon, Mistral) it cut time-to-first-token by up to 87%, raised prefilling throughput up to 7.8×, and reduced end-to-end cost up to 70% versus recomputation.

Problem Statement

Serving multi-turn chat is inefficient because GPUs repeatedly recompute large KV caches for historical tokens. ShareGPT analysis shows 73% of conversations are multi-turn and up to 99% of a new-turn prefilling cost can come from recomputing historical tokens. This wastes GPU time, HBM space, and cloud money.

Main Contribution

CachedAttention: an attention variant that reuses saved KV caches across conversation turns to avoid repeated recomputation.

AttentionStore: a hierarchical KV cache (HBM buffer, host DRAM, SSD) with scheduler-aware prefetch and eviction.

Overlap IO and compute via layer-wise pre-loading and asynchronous saving to hide KV transfer latency.

Decoupled positional encoding for KV caches so truncation keeps saved caches valid and avoids recompute.

Comprehensive experiments on ShareGPT with LLaMA/Falcon/Mistral showing large latency, throughput, and cost gains.

Key Findings

Time-to-first-token (TTFT) drops dramatically when cached KV hits occur

NumbersTTFT reduced by up to 87% (Figure 14)

Prefilling throughput increases by several× because only new tokens are prefilling

NumbersPrefill throughput improved up to 7.8× (Figure 15)

End-to-end inference cost falls because GPU hours drop

NumbersTotal cost reduced up to 70% (Figure 17)

AttentionStore can achieve high cache hit rates across models

NumbersHit rates ≈ 71%–90% depending on model and storage (Figure 13)

Decoupling positional encoding preserves model quality after truncation

NumbersPerplexity difference CA vs recompute < 0.02; NKVT PPL >> 1000 (Table 1)

Results

Time-to-first-token (TTFT)

Value85%–87% reduction

Baselinerecomputation

Prefilling throughput

Value2.6×–7.8×

Baselinerecomputation

End-to-end inference cost

Value43%–70% reduction

Baselinerecomputation (AWS pricing)

AttentionStore cache hit rate

Value71%–90%

Baselineno stored KV (recompute)

GPU time speedup

Value1.9×–4.0×

Baselinerecomputation

Perplexity (PPL) after truncation

ValueCA ≈ recompute (difference < 0.02)

Baselinetoken truncation with recompute

Who Should Care

What To Try In 7 Days

Measure prefilling share of request latency on your logs (is historical token recompute dominant?).

Prototype saving KV cache for active sessions to host DRAM and test simple reuse for a sample workload.

Add a read buffer and layer-wise preloading to hide KV load latency for cached sessions.

Optimization Features

Token Efficiency

  • support for KV truncation without recompute

Infra Optimization

  • move cold KV caches to SSD to trade storage cost for GPU compute

System Optimization

  • hierarchical AttentionStore (DRAM + SSD)
  • HBM read/write buffers to overlap IO

Inference Optimization

  • KV cache reuse across turns
  • layer-wise pre-loading
  • asynchronous KV saving
  • scheduler-aware prefetch/evict
  • decoupled positional encoding (RPE)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires models that use relative position encodings (RPE) or changing model internals to decouple positions.
  • Needs extra host DRAM and SSD capacity; storage cost and management matter for large models.
  • Effectiveness depends on session reuse and KV size; large-KV models need proportionally more storage.
  • Scheduler integration and correct prefetch window tuning are needed to avoid wasted IO.

When Not To Use

  • Workloads dominated by single-turn requests with little session reuse.
  • Environments without host DRAM/SSD access or with extremely tight end-to-end tail-latency SLAs that cannot tolerate prefetch variability.
  • Small models where KV recompute is already cheap and HBM capacity suffices.

Failure Modes

  • Low cache hit rate when distinct-session arrival rate is high, reducing benefits.
  • Insufficient read/write buffer sizes can make prefetching fail and block GPU execution.
  • Model or tokenizer changes that mismatch stored KV format can corrupt reuse.
  • Disk/DRAM IO spikes can degrade latency if eviction/prefetch logic misbehaves.

Core Entities

Models

  • LLaMA-1 65B
  • LLaMA-2 13B
  • LLaMA-2 70B
  • Falcon-40B
  • Mistral-7B

Metrics

  • Time-to-first-token (TTFT)
  • Prefilling throughput
  • GPU time
  • Cache hit rate
  • Perplexity (PPL)
  • Accuracy

Datasets

  • ShareGPT

Benchmarks

  • MMLU
  • LongEval
  • PIQA
  • WikiText-2
  • C4
  • PTB

Context Entities

Datasets

  • ShareGPT (HuggingFace dataset)

Benchmarks

  • MMLU
  • LongEval
  • PIQA