Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
WKVQuant cuts decoding memory of 13B models from ~27GB to ~7GB while keeping accuracy near full-precision; this enables cheaper GPU options and larger batch/sequence support without retraining.
Summary TLDR
WKVQuant is a post-training quantization (PTQ) recipe that targets model weights and the key/value (KV) cache only. It adds three components: Past-Only Quantization (use full-precision current KV, quantize past cache), two-dimensional quantization (channel smoothing + token-wise scaling), and cross-block reconstruction regularization (optimize quant params with downstream-aware loss). On LLaMA-family models W4KV4 (4-bit weights + 4-bit KV) gets memory close to full weight-activation quantization while keeping accuracy close to weight-only W4. Practical upshot: cut KV memory drastically with small accuracy loss and avoid the large accuracy collapse from quantizing short-lived temporary activi
Problem Statement
Deploying LLMs is memory-bound: model weights and the growing KV cache take most memory. Weight-only quantization keeps accuracy but saves limited memory. Weight+activation quantization saves more memory but often breaks accuracy because temporary activations and activation outliers are sensitive. We need a practical quantization that reduces KV cache memory without the accuracy collapse of full activation quantization.
Main Contribution
WKVQuant: a PTQ framework designed to quantize weights and KV cache only.
Past-Only Quantization (POQ): keep current token KV in full precision, quantize only past cached KV during decode.
Two-dimensional Quantization: static channel-wise smoothing plus dynamic token-wise scaling to reduce KV quantization error.
Cross-block Reconstruction Regularization (CRR): optimize quantization parameters with a downstream-aware MAE loss across subsequent blocks (k=5).
Empirical result: W4KV4 saves KV memory like W4A4 but keeps accuracy close to W4 (weight-only) on LLaMA/LLaMA2.
Key Findings
WKVQuant (W4KV4) maintains long-input task performance close to full precision and weight-only quantization while far outperforming weight+activation (W4A4) on long-context tasks.
W4KV4 reduces decoding memory almost as much as weight+activation quantization.
Past-Only Quantization is the single most critical component for accuracy in WKVQuant.
Quantizing temporary activations causes catastrophic degradation.
Results
Longtext avg (LLaMA-2-13B)
Perplexity on WikiText2 (LLaMA-2-13B)
Decoding memory (bs=1, len=2048, LLaMA-2-13B)
Ablation: remove POQ (LLaMA-2-7B)
Temporary activation quantization risk
Who Should Care
What To Try In 7 Days
Run WKVQuant W4KV4 on one LLaMA-family checkpoint and compare memory vs FP16 and W4A4.
Enable Past-Only Quantization in decode to preserve accuracy and measure token-level output quality.
Calibrate quant params on 128 2048-token samples and use CRR with k=5; measure perplexity and LongBench scores.
Optimization Features
Token Efficiency
- reduces KV cache memory footprint (e.g., 27.1GB→6.8GB for 13B)
Infra Optimization
- calibration done on a single A100; optimization time ~3–4 hours (7B≈3h,13B≈4h)
Model Optimization
- weight quantization (W4, group size 128)
- learnable clipping parameters (γ, β)
- OmniQuant-style weight PTQ
System Optimization
- Accuracy
Training Optimization
- Cross-block Reconstruction Regularization (CRR, MAE across k=5 blocks)
- optimize smoothing s and shift δ via AdamW for 5 epochs
Inference Optimization
- Past-Only Quantization (POQ): full-precision current KV, quantized past KV
- Two-dimensional Quantization: static channel smoothing + dynamic token-wise scaling
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Does not quantize temporary activations; in some prefill or very large-batch cases that raises memory compared to full activation quantization.
- Cannot fully leverage platform INT8 acceleration because not all tensors share the same low bit-width.
- CRR calibration adds optimization time (≈3h for 7B, ≈4h for 13B) prior to deployment.
- Evaluations are on LLaMA-family models; behavior on other architectures is untested.
When Not To Use
- You need maximal hardware INT8 throughput across all ops and cannot mix precisions.
- You plan to quantize temporary activations despite known high sensitivity.
- You cannot afford the one-time CRR calibration step or per-model tuning.
Failure Modes
- Disabling POQ sharply reduces accuracy on long-context tasks (ablation shows large drop).
- Token-wise outliers can still damage per-token quantization if group sizes are inappropriate.
- Suboptimal calibration dataset or group-size choices (notably group=128 used here) can raise perplexity.
Core Entities
Models
- LLaMA-2-13B
- LLaMA-2-7B
- LLaMA-7B
- LLaMA-13B
- GPTQ
- OmniQuant
Metrics
- Longtext avg
- Zero-shot avg
- Perplexity (ppl)
- Decoding memory (GB)
- F1 (task datasets)
Datasets
- WikiText2
- PTB
- C4
- LongBench (Qasper, 2WikiMultihopQA, HotpotQA, TriviaQA, LCC, MultiFieldQA-en)
- WikiText2 calibration segments
Benchmarks
- Longtext avg (LongBench)
- Zero-shot avg (PIQA, ARC-Challenge, HellaSwag, WinoGrande)
- Perplexity (WikiText2, PTB, C4)

