Restore LLM context faster by saving hidden states (half the IO, much less recompute)

October 7, 20248 min

Overview

Production Readiness

0.8

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

0

Authors

Shiwei Gao, Youmin Chen, Jiwu Shu

Links

Abstract / PDF

Why It Matters For Business

Stateful LLM services suffer long cold-start latencies when context is evicted. HCache reduces first-response latency and host storage needs by using a smaller, fast-to-project representation. That improves user experience for chatbots and RAG apps and lowers storage bill and I/O bottlenecks.

Summary TLDR

HCache speeds up restoring LLM conversation or long-context state by saving intermediate hidden states instead of full KV cache or raw tokens. Hidden states are ~2x smaller than KV cache and can be projected into KV cache with cheap matrix ops. Combined with a scheduler that mixes methods and a chunked storage layout, HCache cuts first-token latency (TTFT) and storage needs vs. KV offload and vs. recomputation, while adding negligible token-generation overhead.

Problem Statement

Stateful LLMs (multi-round chat, RAG) need to restore context (KV cache) when GPU memory is limited. Existing choices—recompute from tokens or offload KV to host—use either heavy computation or heavy I/O and cause large latency or storage costs. The paper asks: can we restore state faster by using a middle representation that trades some compute for less I/O?

Main Contribution

HCache: store per-layer hidden states (intermediate activations) to restore KV cache with cheaper projection ops instead of full recompute or whole-KV transfer.

Bubble-free restoration scheduler: profile hardware and split layers to mix hidden-state restoration with token recompute or KV offload so IO and compute finish together.

Chunk-based storage + two-stage saving: store layer chunks (64-token chunks, round-robin across SSDs) and snapshot hidden states to host DRAM before flushing to avoid small random writes and stalls.

Implementation and evaluation on Llama2-7B/13B and OPT-30B using DeepSpeed-MII baseline; shows consistent TTFT, restoration speed, and storage improvements across GPUs and SSD configs.

Key Findings

Saving hidden states halves the I/O size compared with offloading KV cache.

Numbershidden states = 0.5× KV cache

Projecting hidden states into KV cache needs much less compute than full token recomputation.

Numbersrecompute-from-hidden ≥6× faster than recompute-from-tokens

HCache reduces Time To First Token (TTFT) vs. KV offload and vs. token recomputation in evaluated traces.

NumbersTTFT improved up to 1.93× vs KV offload; up to 5.73× vs recomputation

HCache lowers per-token storage need compared with KV offload.

Numbersper-token storage 1.92–2.40× smaller than KV offload

Restoration speed depends on hardware balance; HCache yields consistent gains across platforms.

Numbersrestoration speed 1.33–2.66× vs KV offload (varies by GPU/SSD); 5.04–9.05× vs recomputation on some GPUs

HCache adds minimal decoding overhead (TBT) when implemented with two-stage saving.

NumbersTBT overhead ≤ ~4% vs ideal

Results

TTFT (Time To First Token)

Valueup to 1.93× faster vs KV offload; up to 5.73× vs recomputation

BaselineKV offload; token recomputation

Per-token storage

ValueHCache needs 1.92–2.40× less space per token

BaselineKV offload

Restoration speed (throughput measured as restored tokens / time)

Value1.33–2.66× faster vs KV offload across hardware; 5.04–9.05× faster vs recomputation on some GPUs

BaselineKV offload; token recomputation

Decoding overhead (TBT)

Value≤ ~4% overhead vs ideal decoding

Baselineideal case with all KV cached on GPU

Who Should Care

What To Try In 7 Days

Profile your serving hardware (GPU FLOPS vs SSD/PCIe bandwidth) to see if HCache fits your balance.

Instrument current TTFT and GPU cache miss rates on multi-round sessions or RAG workloads.

Prototype saving per-layer hidden states for a small model (7B) using the paper's two-stage snapshot to host DRAM and measure TTFT/TBT changes.

Agent Features

Memory

  • short-term hidden-state storage

Optimization Features

Infra Optimization

  • aggregated SSD bandwidth via round-robin chunk placement
  • multi-GPU shard reads + allgather to parallelize reads

Model Optimization

  • Accuracy

System Optimization

  • chunk-based storage layout (64-token chunks across SSDs)
  • two-stage snapshot to host DRAM to avoid small writes
  • GPUDirect Storage via SPDK + GDRCopy

Inference Optimization

  • KV cache restoration via hidden-state projection
  • bubble-free scheduler to balance compute and IO
  • layer-wise partitioning for optimized GEMM sizes

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benefit depends on hardware balance: if GPU compute is very slow relative to I/O, gains shrink unless scheduler partitions appropriately (§6.2).
  • Requires host storage (SSD/DRAM) and GPUDirect support for best I/O performance; implementation complexity is non-trivial (§5).
  • Adds steady-state storage of hidden states; although smaller than KV, it still consumes host space and needs lifecycle management (Table 3).
  • Not a lossless reduction in all cases: very high GPU-resident KV cache hit rates reduce the occasions where HCache helps (§6.4).

When Not To Use

  • When GPU cache hit ratio is very high (>90%), since restorations are rare and benefits shrink.
  • When target GPUs are extremely low-power (very slow GEMM) and storage bandwidth is high; scheduler may mitigate but gains can be small.
  • When you cannot modify serving stack or add host-side background daemons and GPUDirect toolchain.

Failure Modes

  • Pipeline bubble misconfiguration: poor partitioning can waste IO or compute and make HCache slower than offload (observed in IO-heavy settings without bubble-free scheduler).
  • Small-write stalls: naive direct writes of hidden states to SSD can stall decoding unless two-stage saving is used (§6.3.3).
  • Integration complexity: multi-GPU and GPUDirect paths require careful buffer pinning and coordination.

Core Entities

Models

  • Llama2-7B
  • Llama2-13B
  • OPT-30B

Metrics

  • TTFT (Time To First Token)
  • TBT (Time Between Token)
  • restoration speed (tokens/sec)
  • per-token storage size

Datasets

  • ShareGPT4
  • L-Eval

Benchmarks

  • multi-round conversation trace (ShareGPT4)
  • long-context tasks (L-Eval)

Context Entities

Models

  • Gemma
  • Phi2
  • Qwen1.5