Compress KV cache to sub-4-bit with <0.1 PPL loss and enable million‑to‑10M token inference

January 31, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper evaluates across many models and datasets, provides kernel code and calibration timings, and reports both memory and latency gains; production integration needs engineering (sparse memory handling and kernel integration).

Citations9

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

Links

Abstract / PDF / Code

Why It Matters For Business

Cut KV cache memory 3–7× and preserve accuracy so you can serve much longer contexts on existing GPUs, reducing infrastructure cost or enabling new long-document features.

Who Should Care

Summary TLDR

KVQuant is a practical recipe to quantize the KV cache (stored Keys and Values) so LLMs can run with very long contexts. It combines: per-channel Key quantization done before RoPE (rotary embeddings), a sensitivity-weighted non-uniform datatype (nuqX), per-vector dense-and-sparse outlier handling, attention-sink-aware retention of the first token, and custom CUDA kernels. On LLaMA-family and Mistral models KVQuant (nuq3 + 1% outliers) keeps perplexity within ~+0.07 on Wikitext-2 while cutting KV cache memory ~4.8×; kernels also report up to ~1.7× speedups vs fp16 matvecs. The method enables LLaMA-7B with 1M tokens on a single A100 and 10M tokens on 8 GPUs.

Problem Statement

For long-context inference the KV cache (stored Key/Value activations) dominates GPU memory and bandwidth. Existing activation quantization breaks at ultra-low bits (<4-bit) because outliers, channel structure, and RoPE rotations skew quantization ranges. The field needs a practical low-bit KV cache quantization method that keeps accuracy and reduces memory/bandwidth.

Main Contribution

Per-channel Key quantization applied before RoPE to align with Key outlier channels and avoid RoPE-induced mixing.

nuqX: per-layer sensitivity-weighted non-uniform datatypes computed offline to place quantization signposts where they matter.

Key Findings

3-bit KV cache with 1% sparse outliers keeps perplexity near fp16 on Wikitext-2

NumbersLLaMA-7B PPL 5.75 vs fp16 5.68 (+0.07)

Practical UseUse nuq3 with 1% per-vector outlier storage to get large KV compression with negligible accuracy loss on evaluated benchmarks.

Evidence RefTable 1, Table 12

KV cache memory reduced roughly 4.8× at 3-bit

NumbersLLaMA-7B KV cache 64.0GB → 13.3GB (≈4.8×)

Practical UseExpect ~5× reduction in activation memory for 3-bit nuq3-1%: this directly enables much longer contexts or fewer GPUs.

Evidence RefFigure 1, Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (LLaMA-7B, Wikitext-2)5.75 (nuq3-1%)5.68 (fp16)+0.07Wikitext-2Table 1 (KVQuant-3bit-1%)Table 1
KV cache memory (LLaMA-7B, seqlen 128K)13.3 GB (nuq3-1%)64.0 GB (fp16)≈4.8× smallermemory estimate seqlen=128KTable 1 and Figure 1Table 1

What To Try In 7 Days

Run nuq3-1% KVQuant calibration on your model with 16 calibration samples and test Wikitext-like perplexity.

Keep the first token in fp16 (attention-sink-aware) and extract ~1% outliers per-vector to observe big accuracy gains.

Integrate the provided CUDA kernels (or implement LUT-based dequantization + CSR/CSC outlier storage) to measure latency vs fp16 on your GPU.

Optimization Features

Token Efficiency
enables longer token windows with same GPUs
Infra Optimization
reduce multi-GPU memory needs for long contextsoptionally run topk outlier detection on CPU in parallel
Model Optimization
non-uniform datatypes (nuqX)per-channel Key quantization
System Optimization
custom CUDA kernels with LUT dequantizebalanced sparse matvec kernelCSR/CSC sparse layout for outliers
Inference Optimization
per-token Value quantizationpre-RoPE Key quantizationper-vector dense-and-sparse outlier extractionattention-sink-aware retention of first tokenoffline per-layer calibration for Keysonline per-token outlier/scale computation for Values

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Work targets inference only; does not solve training long-context (>100k) issues.

Latency benchmarks focus on memory-bandwidth-bound generation, not batched prefill/prompt compression.

When Not To Use

When you require full-precision activations for downstream tasks (e.g., precise numeric outputs).

If you cannot modify inference kernels or add custom CUDA implementations.

Failure Modes

Very low-bit (2-bit) without proper outlier extraction can cause large perplexity regressions.

Incorrect offline calibration for Keys may hurt accuracy if outliers are not handled.

Core Entities

Models

LLaMALlama-2Llama-3Mistral

Metrics

perplexitypasskey retrieval success ratelatency (microseconds)KV cache size (GB)

Datasets

Wikitext-2C4LongBenchRULERpasskey retrieval benchmark

Benchmarks

Wikitext-2 perplexityC4 perplexityLongBenchRULERpasskey retrieval