Restore LLM context faster by saving hidden states (half the IO, much less recompute)

Overview

Decision SnapshotReady For Pilot

Evaluation on multiple models, traces, and hardware shows consistent latency/storage wins, but benefits depend on GPU vs I/O balance and cache hit rates.

Citations0

Evidence Strength0.85

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 65%

Authors

Shiwei Gao, Youmin Chen, Jiwu Shu

Links

Abstract / PDF / Data

Why It Matters For Business

Stateful LLM services suffer long cold-start latencies when context is evicted. HCache reduces first-response latency and host storage needs by using a smaller, fast-to-project representation. That improves user experience for chatbots and RAG apps and lowers storage bill and I/O bottlenecks.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

HCache speeds up restoring LLM conversation or long-context state by saving intermediate hidden states instead of full KV cache or raw tokens. Hidden states are ~2x smaller than KV cache and can be projected into KV cache with cheap matrix ops. Combined with a scheduler that mixes methods and a chunked storage layout, HCache cuts first-token latency (TTFT) and storage needs vs. KV offload and vs. recomputation, while adding negligible token-generation overhead.

Problem Statement

Stateful LLMs (multi-round chat, RAG) need to restore context (KV cache) when GPU memory is limited. Existing choices—recompute from tokens or offload KV to host—use either heavy computation or heavy I/O and cause large latency or storage costs. The paper asks: can we restore state faster by using a middle representation that trades some compute for less I/O?

Main Contribution

HCache: store per-layer hidden states (intermediate activations) to restore KV cache with cheaper projection ops instead of full recompute or whole-KV transfer.

Bubble-free restoration scheduler: profile hardware and split layers to mix hidden-state restoration with token recompute or KV offload so IO and compute finish together.

Key Findings

Saving hidden states halves the I/O size compared with offloading KV cache.

Numbershidden states = 0.5× KV cache

Practical UseStore hidden states to cut host-GPU transfer volume by ~2×; useful when PCIe/SSD bandwidth is the bottleneck.

Evidence Ref§3.2; Figure 1

Projecting hidden states into KV cache needs much less compute than full token recomputation.

Numbersrecompute-from-hidden ≥6× faster than recompute-from-tokens

Practical UseUse HCache to avoid heavy attention and FFN passes during restoration and lower CPU/GPU compute cost for misses.

Evidence Ref§3.2 (theoretical) and §6.2 (empirical)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TTFT (Time To First Token)	up to 1.93× faster vs KV offload; up to 5.73× vs recomputation	KV offload; token recomputation	1.93× (KV offload), 5.73× (recompute)	L-Eval and ShareGPT4 traces	§6.1.1, §6.1.2; Figures 9,10	Figure 10, Figure 9
Per-token storage	HCache needs 1.92–2.40× less space per token	KV offload	1.92–2.40× reduction	models: 7B/13B/30B (Table 3)	§6.1.3; Table 3	Table 3

What To Try In 7 Days

Profile your serving hardware (GPU FLOPS vs SSD/PCIe bandwidth) to see if HCache fits your balance.

Instrument current TTFT and GPU cache miss rates on multi-round sessions or RAG workloads.

Prototype saving per-layer hidden states for a small model (7B) using the paper's two-stage snapshot to host DRAM and measure TTFT/TBT changes.

Agent Features

Memory

short-term hidden-state storage

Optimization Features

Infra Optimization

aggregated SSD bandwidth via round-robin chunk placementmulti-GPU shard reads + allgather to parallelize reads

Model Optimization

Accuracy

System Optimization

chunk-based storage layout (64-token chunks across SSDs)two-stage snapshot to host DRAM to avoid small writesGPUDirect Storage via SPDK + GDRCopy

Inference Optimization

KV cache restoration via hidden-state projectionbubble-free scheduler to balance compute and IOlayer-wise partitioning for optimized GEMM sizes

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset https://arxiv.org/abs/2307.11088 (L-Eval reference)

Risks & Boundaries

Limitations

Benefit depends on hardware balance: if GPU compute is very slow relative to I/O, gains shrink unless scheduler partitions appropriately (§6.2).

Requires host storage (SSD/DRAM) and GPUDirect support for best I/O performance; implementation complexity is non-trivial (§5).

When Not To Use

When GPU cache hit ratio is very high (>90%), since restorations are rare and benefits shrink.

When target GPUs are extremely low-power (very slow GEMM) and storage bandwidth is high; scheduler may mitigate but gains can be small.

Failure Modes

Pipeline bubble misconfiguration: poor partitioning can waste IO or compute and make HCache slower than offload (observed in IO-heavy settings without bubble-free scheduler).

Small-write stalls: naive direct writes of hidden states to SSD can stall decoding unless two-stage saving is used (§6.3.3).

Core Entities

Models

Llama2-7BLlama2-13BOPT-30B

Metrics

TTFT (Time To First Token)TBT (Time Between Token)restoration speed (tokens/sec)per-token storage size

Datasets

ShareGPT4L-Eval

Benchmarks

multi-round conversation trace (ShareGPT4)long-context tasks (L-Eval)

Context Entities

Models

GemmaPhi2Qwen1.5

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Saving hidden states halves the I/O size compared with offloading KV cache.

Projecting hidden states into KV cache needs much less compute than full token recomputation.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding