Restore LLM context faster by saving hidden states (half the IO, much less recompute)

October 7, 20248 min

Overview

Decision SnapshotReady For Pilot

Evaluation on multiple models, traces, and hardware shows consistent latency/storage wins, but benefits depend on GPU vs I/O balance and cache hit rates.

Citations0

Evidence Strength0.85

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 65%

Authors

Shiwei Gao, Youmin Chen, Jiwu Shu

Links

Abstract / PDF / Data

Why It Matters For Business

Stateful LLM services suffer long cold-start latencies when context is evicted. HCache reduces first-response latency and host storage needs by using a smaller, fast-to-project representation. That improves user experience for chatbots and RAG apps and lowers storage bill and I/O bottlenecks.

Who Should Care

Summary TLDR

HCache speeds up restoring LLM conversation or long-context state by saving intermediate hidden states instead of full KV cache or raw tokens. Hidden states are ~2x smaller than KV cache and can be projected into KV cache with cheap matrix ops. Combined with a scheduler that mixes methods and a chunked storage layout, HCache cuts first-token latency (TTFT) and storage needs vs. KV offload and vs. recomputation, while adding negligible token-generation overhead.

Problem Statement

Stateful LLMs (multi-round chat, RAG) need to restore context (KV cache) when GPU memory is limited. Existing choices—recompute from tokens or offload KV to host—use either heavy computation or heavy I/O and cause large latency or storage costs. The paper asks: can we restore state faster by using a middle representation that trades some compute for less I/O?

Main Contribution

HCache: store per-layer hidden states (intermediate activations) to restore KV cache with cheaper projection ops instead of full recompute or whole-KV transfer.

Bubble-free restoration scheduler: profile hardware and split layers to mix hidden-state restoration with token recompute or KV offload so IO and compute finish together.

Key Findings

Saving hidden states halves the I/O size compared with offloading KV cache.

Numbershidden states = 0.5× KV cache

Practical UseStore hidden states to cut host-GPU transfer volume by ~2×; useful when PCIe/SSD bandwidth is the bottleneck.

Evidence Ref§3.2; Figure 1

Projecting hidden states into KV cache needs much less compute than full token recomputation.

Numbersrecompute-from-hidden ≥6× faster than recompute-from-tokens

Practical UseUse HCache to avoid heavy attention and FFN passes during restoration and lower CPU/GPU compute cost for misses.

Evidence Ref§3.2 (theoretical) and §6.2 (empirical)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TTFT (Time To First Token)up to 1.93× faster vs KV offload; up to 5.73× vs recomputationKV offload; token recomputation1.93× (KV offload), 5.73× (recompute)L-Eval and ShareGPT4 traces§6.1.1, §6.1.2; Figures 9,10Figure 10, Figure 9
Per-token storageHCache needs 1.922.40× less space per tokenKV offload1.922.40× reductionmodels: 7B/13B/30B (Table 3)§6.1.3; Table 3Table 3

What To Try In 7 Days

Profile your serving hardware (GPU FLOPS vs SSD/PCIe bandwidth) to see if HCache fits your balance.

Instrument current TTFT and GPU cache miss rates on multi-round sessions or RAG workloads.

Prototype saving per-layer hidden states for a small model (7B) using the paper's two-stage snapshot to host DRAM and measure TTFT/TBT changes.

Agent Features

Memory
short-term hidden-state storage

Optimization Features

Infra Optimization
aggregated SSD bandwidth via round-robin chunk placementmulti-GPU shard reads + allgather to parallelize reads
Model Optimization
Accuracy
System Optimization
chunk-based storage layout (64-token chunks across SSDs)two-stage snapshot to host DRAM to avoid small writesGPUDirect Storage via SPDK + GDRCopy
Inference Optimization
KV cache restoration via hidden-state projectionbubble-free scheduler to balance compute and IOlayer-wise partitioning for optimized GEMM sizes

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benefit depends on hardware balance: if GPU compute is very slow relative to I/O, gains shrink unless scheduler partitions appropriately (§6.2).

Requires host storage (SSD/DRAM) and GPUDirect support for best I/O performance; implementation complexity is non-trivial (§5).

When Not To Use

When GPU cache hit ratio is very high (>90%), since restorations are rare and benefits shrink.

When target GPUs are extremely low-power (very slow GEMM) and storage bandwidth is high; scheduler may mitigate but gains can be small.

Failure Modes

Pipeline bubble misconfiguration: poor partitioning can waste IO or compute and make HCache slower than offload (observed in IO-heavy settings without bubble-free scheduler).

Small-write stalls: naive direct writes of hidden states to SSD can stall decoding unless two-stage saving is used (§6.3.3).

Core Entities

Models

Llama2-7BLlama2-13BOPT-30B

Metrics

TTFT (Time To First Token)TBT (Time Between Token)restoration speed (tokens/sec)per-token storage size

Datasets

ShareGPT4L-Eval

Benchmarks

multi-round conversation trace (ShareGPT4)long-context tasks (L-Eval)

Context Entities

Models

GemmaPhi2Qwen1.5