Store KV cache as compact PQ embeddings to fetch only relevant keys for long-context LLMs

July 1, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui

Links

Abstract / PDF

Why It Matters For Business

PQCache reduces GPU memory needs for long-context LLMs while keeping or improving accuracy, lowering hardware cost and enabling longer-context features without costly GPU scaling.

Summary TLDR

PQCache treats KVCache (keys/values used in self-attention) as an embedding retrieval problem. It compresses keys with Product Quantization (PQ) during prefilling, stores PQ codes/centroids on CPU/GPU, and at decode time uses approximate PQ search to fetch a small top-k subset of key-value pairs. This reduces GPU memory pressure while retaining or improving model quality across long-context benchmarks (e.g., +4.60% on InfiniteBench) and keeps per-token latency low via overlapping, prefetching, and a block-level GPU cache.

Problem Statement

KVCache (stored keys and values) grows linearly with sequence length and quickly exceeds GPU memory for long-context inference. Existing methods that drop or offload KVCache either hurt model quality or add unacceptable latency. A solution must keep model quality while minimizing I/O and latency.

Main Contribution

Reframe KVCache management as an approximate nearest-neighbor (ANNS) retrieval problem and apply Product Quantization (PQ) to keys

PQCache system that: PQ-compresses keys per head/layer during prefilling; keeps PQ centroids on GPU; uses PQ search to find top-k relevant tokens and fetch only their KV pairs

System-level optimizations: asynchronous PQ construction, overlapping GPU–CPU communication, prefetching PQ codes, and a block-level GPU cache to hide I/O latency

Key Findings

PQCache improves aggregate InfiniteBench scores versus prior selective-attention methods

Numbers+4.60% avg score vs baselines on InfiniteBench

On LongBench PQCache beats baselines when using fewer tokens

Numbers+1.74% (1/5 tokens) and +3.90% (1/10 tokens) avg vs baselines on LongBench

PQCache keeps decoding latency low and stable with sequence length

NumbersPer-token latency remains below human reading speed (~0.18s/token) and TT2T is near lowest among methods

Block-level GPU cache meaningfully reduces token fetch time

NumbersBlock-level cache reduces TPOT by 26.3% (4K) and 32.8% (8K); cache hit-rate ≈0.5–0.6 with 32 blocks

PQ structures cost is small and controllable

NumbersExample: PQ centroids ≈16 MB for m=2,b=6 on a 7B GQA model; PQ codes communicate ≤1/128 of original keys for chosen PQs

Results

InfiniteBench average score delta vs baselines

Value+4.60%

Baselineexisting selective-attention/offloading baselines

LongBench average score delta vs baselines

Value+1.74% (1/5 tokens); +3.90% (1/10 tokens)

Baselineexisting methods (SPARQ, InfLLM, H2O, SnapKV, PyramidKV)

GPU cache speedup on TPOT

ValueTPOT reduced 26.3% (4K cache) and 32.8% (8K cache) vs no-cache

Baselineno GPU cache

Who Should Care

What To Try In 7 Days

Run PQCache on a dev GPU+CPU server to offload KVCache and verify end-to-end latency on your longest prompts

Tune PQ hyperparameters (m, b) with low iterations to fit CPU budget, then test selective-attention ratios (e.g., 1/5, 1/10)

Add a small block-level GPU cache (32 blocks) and measure hit-rate and per-token latency to balance cost and speed

Agent Features

Memory

  • KVCache offload to CPU
  • block-level GPU cache for frequent blocks

Optimization Features

Token Efficiency

  • selective attention via approximate top-k retrieval

Infra Optimization

  • use CPU memory for full KVCache storage
  • trade CPU clustering iterations vs latency

System Optimization

  • asynchronous GPU-to-CPU offload
  • adaptive K-Means iterations to avoid blocking GPU
  • pre-fetching of next-layer PQ codes

Inference Optimization

  • Product Quantization (PQ) on keys for ANNS-based selective attention
  • Overlapped K-Means clustering on CPU with GPU computation
  • Prefetching PQ codes and centroids to GPU
  • Block-level GPU cache (LFU/LRU) to reduce fetch latency

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Performance depends on CPU clustering capacity; limited clustering iterations can reduce accuracy
  • PQ is an approximation: poor centroids or too few clusters hurt retrieval recall
  • Block-level cache and partitioning assumptions may not match all workloads
  • Long-running outputs may require periodic PQ re-construction

When Not To Use

  • GPU-only environments without spare CPU or slow CPU where clustering would block GPU
  • Workloads with extremely strict single-token latency constraints that cannot tolerate any CPU round-trip
  • Scenarios where per-token exact attention over the full context is required for correctness

Failure Modes

  • PQ approximation misses important keys causing degraded generation
  • adaptive K-Means clipping too aggressive, producing poor centroids
  • cache churn or mis-partitioning reduces hit-rate and increases fetch latency

Core Entities

Models

  • Llama-3.1-8B
  • Mistral-7B-Instruct-v0.2
  • Llama-3.1-70B

Metrics

  • average task score
  • Accuracy
  • Rouge-L
  • F1
  • Time To 2nd Token (TT2T)
  • Time Per Output Token (TPOT)

Datasets

  • LongBench
  • InfiniteBench
  • HotPotQA
  • GSM8k CoT
  • Needle-in-a-Haystack
  • XSUM
  • NarrativeQA
  • Qasper
  • MultiFieldQA
  • 2WikiMQA
  • MultiNews
  • TREC
  • TriviaQA
  • SAMSum
  • Musique
  • GovReport
  • QMSum

Benchmarks

  • LongBench
  • InfiniteBench