Compress KV cache to sub-4-bit with <0.1 PPL loss and enable million‑to‑10M token inference

Overview

Decision SnapshotReady For Pilot

The paper evaluates across many models and datasets, provides kernel code and calibration timings, and reports both memory and latency gains; production integration needs engineering (sparse memory handling and kernel integration).

Citations9

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

Links

Abstract / PDF / Code

Why It Matters For Business

Cut KV cache memory 3–7× and preserve accuracy so you can serve much longer contexts on existing GPUs, reducing infrastructure cost or enabling new long-document features.

Who Should Care

ML Engineer Engineering Lead CTO Founder

Summary TLDR

KVQuant is a practical recipe to quantize the KV cache (stored Keys and Values) so LLMs can run with very long contexts. It combines: per-channel Key quantization done before RoPE (rotary embeddings), a sensitivity-weighted non-uniform datatype (nuqX), per-vector dense-and-sparse outlier handling, attention-sink-aware retention of the first token, and custom CUDA kernels. On LLaMA-family and Mistral models KVQuant (nuq3 + 1% outliers) keeps perplexity within ~+0.07 on Wikitext-2 while cutting KV cache memory ~4.8×; kernels also report up to ~1.7× speedups vs fp16 matvecs. The method enables LLaMA-7B with 1M tokens on a single A100 and 10M tokens on 8 GPUs.

Problem Statement

For long-context inference the KV cache (stored Key/Value activations) dominates GPU memory and bandwidth. Existing activation quantization breaks at ultra-low bits (<4-bit) because outliers, channel structure, and RoPE rotations skew quantization ranges. The field needs a practical low-bit KV cache quantization method that keeps accuracy and reduces memory/bandwidth.

Main Contribution

Per-channel Key quantization applied before RoPE to align with Key outlier channels and avoid RoPE-induced mixing.

nuqX: per-layer sensitivity-weighted non-uniform datatypes computed offline to place quantization signposts where they matter.

Key Findings

3-bit KV cache with 1% sparse outliers keeps perplexity near fp16 on Wikitext-2

NumbersLLaMA-7B PPL 5.75 vs fp16 5.68 (+0.07)

Practical UseUse nuq3 with 1% per-vector outlier storage to get large KV compression with negligible accuracy loss on evaluated benchmarks.

Evidence RefTable 1, Table 12

KV cache memory reduced roughly 4.8× at 3-bit

NumbersLLaMA-7B KV cache 64.0GB → 13.3GB (≈4.8×)

Practical UseExpect ~5× reduction in activation memory for 3-bit nuq3-1%: this directly enables much longer contexts or fewer GPUs.

Evidence RefFigure 1, Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (LLaMA-7B, Wikitext-2)	5.75 (nuq3-1%)	5.68 (fp16)	+0.07	Wikitext-2	Table 1 (KVQuant-3bit-1%)	Table 1
KV cache memory (LLaMA-7B, seqlen 128K)	13.3 GB (nuq3-1%)	64.0 GB (fp16)	≈4.8× smaller	memory estimate seqlen=128K	Table 1 and Figure 1	Table 1

What To Try In 7 Days

Run nuq3-1% KVQuant calibration on your model with 16 calibration samples and test Wikitext-like perplexity.

Keep the first token in fp16 (attention-sink-aware) and extract ~1% outliers per-vector to observe big accuracy gains.

Integrate the provided CUDA kernels (or implement LUT-based dequantization + CSR/CSC outlier storage) to measure latency vs fp16 on your GPU.

Optimization Features

Token Efficiency

enables longer token windows with same GPUs

Infra Optimization

reduce multi-GPU memory needs for long contextsoptionally run topk outlier detection on CPU in parallel

Model Optimization

non-uniform datatypes (nuqX)per-channel Key quantization

System Optimization

custom CUDA kernels with LUT dequantizebalanced sparse matvec kernelCSR/CSC sparse layout for outliers

Inference Optimization

per-token Value quantizationpre-RoPE Key quantizationper-vector dense-and-sparse outlier extractionattention-sink-aware retention of first tokenoffline per-layer calibration for Keysonline per-token outlier/scale computation for Values

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/SqueezeAILab/KVQuant

Risks & Boundaries

Limitations

Work targets inference only; does not solve training long-context (>100k) issues.

Latency benchmarks focus on memory-bandwidth-bound generation, not batched prefill/prompt compression.

When Not To Use

When you require full-precision activations for downstream tasks (e.g., precise numeric outputs).

If you cannot modify inference kernels or add custom CUDA implementations.

Failure Modes

Very low-bit (2-bit) without proper outlier extraction can cause large perplexity regressions.

Incorrect offline calibration for Keys may hurt accuracy if outliers are not handled.

Core Entities

Models

LLaMALlama-2Llama-3Mistral

Metrics

perplexitypasskey retrieval success ratelatency (microseconds)KV cache size (GB)

Datasets

Wikitext-2C4LongBenchRULERpasskey retrieval benchmark

Benchmarks

Wikitext-2 perplexityC4 perplexityLongBenchRULERpasskey retrieval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

3-bit KV cache with 1% sparse outliers keeps perplexity near fp16 on Wikitext-2

KV cache memory reduced roughly 4.8× at 3-bit

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding