Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
PQCache reduces GPU memory needs for long-context LLMs while keeping or improving accuracy, lowering hardware cost and enabling longer-context features without costly GPU scaling.
Summary TLDR
PQCache treats KVCache (keys/values used in self-attention) as an embedding retrieval problem. It compresses keys with Product Quantization (PQ) during prefilling, stores PQ codes/centroids on CPU/GPU, and at decode time uses approximate PQ search to fetch a small top-k subset of key-value pairs. This reduces GPU memory pressure while retaining or improving model quality across long-context benchmarks (e.g., +4.60% on InfiniteBench) and keeps per-token latency low via overlapping, prefetching, and a block-level GPU cache.
Problem Statement
KVCache (stored keys and values) grows linearly with sequence length and quickly exceeds GPU memory for long-context inference. Existing methods that drop or offload KVCache either hurt model quality or add unacceptable latency. A solution must keep model quality while minimizing I/O and latency.
Main Contribution
Reframe KVCache management as an approximate nearest-neighbor (ANNS) retrieval problem and apply Product Quantization (PQ) to keys
PQCache system that: PQ-compresses keys per head/layer during prefilling; keeps PQ centroids on GPU; uses PQ search to find top-k relevant tokens and fetch only their KV pairs
System-level optimizations: asynchronous PQ construction, overlapping GPU–CPU communication, prefetching PQ codes, and a block-level GPU cache to hide I/O latency
Key Findings
PQCache improves aggregate InfiniteBench scores versus prior selective-attention methods
On LongBench PQCache beats baselines when using fewer tokens
PQCache keeps decoding latency low and stable with sequence length
Block-level GPU cache meaningfully reduces token fetch time
PQ structures cost is small and controllable
Results
InfiniteBench average score delta vs baselines
LongBench average score delta vs baselines
GPU cache speedup on TPOT
Who Should Care
What To Try In 7 Days
Run PQCache on a dev GPU+CPU server to offload KVCache and verify end-to-end latency on your longest prompts
Tune PQ hyperparameters (m, b) with low iterations to fit CPU budget, then test selective-attention ratios (e.g., 1/5, 1/10)
Add a small block-level GPU cache (32 blocks) and measure hit-rate and per-token latency to balance cost and speed
Agent Features
Memory
- KVCache offload to CPU
- block-level GPU cache for frequent blocks
Optimization Features
Token Efficiency
- selective attention via approximate top-k retrieval
Infra Optimization
- use CPU memory for full KVCache storage
- trade CPU clustering iterations vs latency
System Optimization
- asynchronous GPU-to-CPU offload
- adaptive K-Means iterations to avoid blocking GPU
- pre-fetching of next-layer PQ codes
Inference Optimization
- Product Quantization (PQ) on keys for ANNS-based selective attention
- Overlapped K-Means clustering on CPU with GPU computation
- Prefetching PQ codes and centroids to GPU
- Block-level GPU cache (LFU/LRU) to reduce fetch latency
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Performance depends on CPU clustering capacity; limited clustering iterations can reduce accuracy
- PQ is an approximation: poor centroids or too few clusters hurt retrieval recall
- Block-level cache and partitioning assumptions may not match all workloads
- Long-running outputs may require periodic PQ re-construction
When Not To Use
- GPU-only environments without spare CPU or slow CPU where clustering would block GPU
- Workloads with extremely strict single-token latency constraints that cannot tolerate any CPU round-trip
- Scenarios where per-token exact attention over the full context is required for correctness
Failure Modes
- PQ approximation misses important keys causing degraded generation
- adaptive K-Means clipping too aggressive, producing poor centroids
- cache churn or mis-partitioning reduces hit-rate and increases fetch latency
Core Entities
Models
- Llama-3.1-8B
- Mistral-7B-Instruct-v0.2
- Llama-3.1-70B
Metrics
- average task score
- Accuracy
- Rouge-L
- F1
- Time To 2nd Token (TT2T)
- Time Per Output Token (TPOT)
Datasets
- LongBench
- InfiniteBench
- HotPotQA
- GSM8k CoT
- Needle-in-a-Haystack
- XSUM
- NarrativeQA
- Qasper
- MultiFieldQA
- 2WikiMQA
- MultiNews
- TREC
- TriviaQA
- SAMSum
- Musique
- GovReport
- QMSum
Benchmarks
- LongBench
- InfiniteBench

