Overview
Production Readiness
0.8
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Stateful LLM services suffer long cold-start latencies when context is evicted. HCache reduces first-response latency and host storage needs by using a smaller, fast-to-project representation. That improves user experience for chatbots and RAG apps and lowers storage bill and I/O bottlenecks.
Summary TLDR
HCache speeds up restoring LLM conversation or long-context state by saving intermediate hidden states instead of full KV cache or raw tokens. Hidden states are ~2x smaller than KV cache and can be projected into KV cache with cheap matrix ops. Combined with a scheduler that mixes methods and a chunked storage layout, HCache cuts first-token latency (TTFT) and storage needs vs. KV offload and vs. recomputation, while adding negligible token-generation overhead.
Problem Statement
Stateful LLMs (multi-round chat, RAG) need to restore context (KV cache) when GPU memory is limited. Existing choices—recompute from tokens or offload KV to host—use either heavy computation or heavy I/O and cause large latency or storage costs. The paper asks: can we restore state faster by using a middle representation that trades some compute for less I/O?
Main Contribution
HCache: store per-layer hidden states (intermediate activations) to restore KV cache with cheaper projection ops instead of full recompute or whole-KV transfer.
Bubble-free restoration scheduler: profile hardware and split layers to mix hidden-state restoration with token recompute or KV offload so IO and compute finish together.
Chunk-based storage + two-stage saving: store layer chunks (64-token chunks, round-robin across SSDs) and snapshot hidden states to host DRAM before flushing to avoid small random writes and stalls.
Implementation and evaluation on Llama2-7B/13B and OPT-30B using DeepSpeed-MII baseline; shows consistent TTFT, restoration speed, and storage improvements across GPUs and SSD configs.
Key Findings
Saving hidden states halves the I/O size compared with offloading KV cache.
Projecting hidden states into KV cache needs much less compute than full token recomputation.
HCache reduces Time To First Token (TTFT) vs. KV offload and vs. token recomputation in evaluated traces.
HCache lowers per-token storage need compared with KV offload.
Restoration speed depends on hardware balance; HCache yields consistent gains across platforms.
HCache adds minimal decoding overhead (TBT) when implemented with two-stage saving.
Results
TTFT (Time To First Token)
Per-token storage
Restoration speed (throughput measured as restored tokens / time)
Decoding overhead (TBT)
Who Should Care
What To Try In 7 Days
Profile your serving hardware (GPU FLOPS vs SSD/PCIe bandwidth) to see if HCache fits your balance.
Instrument current TTFT and GPU cache miss rates on multi-round sessions or RAG workloads.
Prototype saving per-layer hidden states for a small model (7B) using the paper's two-stage snapshot to host DRAM and measure TTFT/TBT changes.
Agent Features
Memory
- short-term hidden-state storage
Optimization Features
Infra Optimization
- aggregated SSD bandwidth via round-robin chunk placement
- multi-GPU shard reads + allgather to parallelize reads
Model Optimization
- Accuracy
System Optimization
- chunk-based storage layout (64-token chunks across SSDs)
- two-stage snapshot to host DRAM to avoid small writes
- GPUDirect Storage via SPDK + GDRCopy
Inference Optimization
- KV cache restoration via hidden-state projection
- bubble-free scheduler to balance compute and IO
- layer-wise partitioning for optimized GEMM sizes
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benefit depends on hardware balance: if GPU compute is very slow relative to I/O, gains shrink unless scheduler partitions appropriately (§6.2).
- Requires host storage (SSD/DRAM) and GPUDirect support for best I/O performance; implementation complexity is non-trivial (§5).
- Adds steady-state storage of hidden states; although smaller than KV, it still consumes host space and needs lifecycle management (Table 3).
- Not a lossless reduction in all cases: very high GPU-resident KV cache hit rates reduce the occasions where HCache helps (§6.4).
When Not To Use
- When GPU cache hit ratio is very high (>90%), since restorations are rare and benefits shrink.
- When target GPUs are extremely low-power (very slow GEMM) and storage bandwidth is high; scheduler may mitigate but gains can be small.
- When you cannot modify serving stack or add host-side background daemons and GPUDirect toolchain.
Failure Modes
- Pipeline bubble misconfiguration: poor partitioning can waste IO or compute and make HCache slower than offload (observed in IO-heavy settings without bubble-free scheduler).
- Small-write stalls: naive direct writes of hidden states to SSD can stall decoding unless two-stage saving is used (§6.3.3).
- Integration complexity: multi-GPU and GPUDirect paths require careful buffer pinning and coordination.
Core Entities
Models
- Llama2-7B
- Llama2-13B
- OPT-30B
Metrics
- TTFT (Time To First Token)
- TBT (Time Between Token)
- restoration speed (tokens/sec)
- per-token storage size
Datasets
- ShareGPT4
- L-Eval
Benchmarks
- multi-round conversation trace (ShareGPT4)
- long-context tasks (L-Eval)
Context Entities
Models
- Gemma
- Phi2
- Qwen1.5

