Overview
Evaluation on multiple models, traces, and hardware shows consistent latency/storage wins, but benefits depend on GPU vs I/O balance and cache hit rates.
Citations0
Evidence Strength0.85
Confidence0.82
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 80%
Novelty: 65%
Why It Matters For Business
Stateful LLM services suffer long cold-start latencies when context is evicted. HCache reduces first-response latency and host storage needs by using a smaller, fast-to-project representation. That improves user experience for chatbots and RAG apps and lowers storage bill and I/O bottlenecks.
Who Should Care
Summary TLDR
HCache speeds up restoring LLM conversation or long-context state by saving intermediate hidden states instead of full KV cache or raw tokens. Hidden states are ~2x smaller than KV cache and can be projected into KV cache with cheap matrix ops. Combined with a scheduler that mixes methods and a chunked storage layout, HCache cuts first-token latency (TTFT) and storage needs vs. KV offload and vs. recomputation, while adding negligible token-generation overhead.
Problem Statement
Stateful LLMs (multi-round chat, RAG) need to restore context (KV cache) when GPU memory is limited. Existing choices—recompute from tokens or offload KV to host—use either heavy computation or heavy I/O and cause large latency or storage costs. The paper asks: can we restore state faster by using a middle representation that trades some compute for less I/O?
Main Contribution
HCache: store per-layer hidden states (intermediate activations) to restore KV cache with cheaper projection ops instead of full recompute or whole-KV transfer.
Bubble-free restoration scheduler: profile hardware and split layers to mix hidden-state restoration with token recompute or KV offload so IO and compute finish together.
Key Findings
Saving hidden states halves the I/O size compared with offloading KV cache.
Projecting hidden states into KV cache needs much less compute than full token recomputation.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| TTFT (Time To First Token) | up to 1.93× faster vs KV offload; up to 5.73× vs recomputation | KV offload; token recomputation | 1.93× (KV offload), 5.73× (recompute) | L-Eval and ShareGPT4 traces | §6.1.1, §6.1.2; Figures 9,10 | Figure 10, Figure 9 |
| Per-token storage | HCache needs 1.92–2.40× less space per token | KV offload | 1.92–2.40× reduction | models: 7B/13B/30B (Table 3) | §6.1.3; Table 3 | Table 3 |
What To Try In 7 Days
Profile your serving hardware (GPU FLOPS vs SSD/PCIe bandwidth) to see if HCache fits your balance.
Instrument current TTFT and GPU cache miss rates on multi-round sessions or RAG workloads.
Prototype saving per-layer hidden states for a small model (7B) using the paper's two-stage snapshot to host DRAM and measure TTFT/TBT changes.
Agent Features
Memory
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Benefit depends on hardware balance: if GPU compute is very slow relative to I/O, gains shrink unless scheduler partitions appropriately (§6.2).
Requires host storage (SSD/DRAM) and GPUDirect support for best I/O performance; implementation complexity is non-trivial (§5).
When Not To Use
When GPU cache hit ratio is very high (>90%), since restorations are rare and benefits shrink.
When target GPUs are extremely low-power (very slow GEMM) and storage bandwidth is high; scheduler may mitigate but gains can be small.
Failure Modes
Pipeline bubble misconfiguration: poor partitioning can waste IO or compute and make HCache slower than offload (observed in IO-heavy settings without bubble-free scheduler).
Small-write stalls: naive direct writes of hidden states to SSD can stall decoding unless two-stage saving is used (§6.3.3).

