Quantize weights + KV cache (not all activations) to save large memory with much less accuracy loss

February 19, 20247 min

Overview

Decision SnapshotReady For Pilot

Scores reflect clear, reproducible PTQ experiments on multiple LLaMA variants with ablations; code release is not stated so reproduction requires re-implementation and calibration.

Citations2

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie

Links

Abstract / PDF

Why It Matters For Business

WKVQuant cuts decoding memory of 13B models from ~27GB to ~7GB while keeping accuracy near full-precision; this enables cheaper GPU options and larger batch/sequence support without retraining.

Who Should Care

Summary TLDR

WKVQuant is a post-training quantization (PTQ) recipe that targets model weights and the key/value (KV) cache only. It adds three components: Past-Only Quantization (use full-precision current KV, quantize past cache), two-dimensional quantization (channel smoothing + token-wise scaling), and cross-block reconstruction regularization (optimize quant params with downstream-aware loss). On LLaMA-family models W4KV4 (4-bit weights + 4-bit KV) gets memory close to full weight-activation quantization while keeping accuracy close to weight-only W4. Practical upshot: cut KV memory drastically with small accuracy loss and avoid the large accuracy collapse from quantizing short-lived temporary activi

Problem Statement

Deploying LLMs is memory-bound: model weights and the growing KV cache take most memory. Weight-only quantization keeps accuracy but saves limited memory. Weight+activation quantization saves more memory but often breaks accuracy because temporary activations and activation outliers are sensitive. We need a practical quantization that reduces KV cache memory without the accuracy collapse of full activation quantization.

Main Contribution

WKVQuant: a PTQ framework designed to quantize weights and KV cache only.

Past-Only Quantization (POQ): keep current token KV in full precision, quantize only past cached KV during decode.

Key Findings

WKVQuant (W4KV4) maintains long-input task performance close to full precision and weight-only quantization while far outperforming weight+activation (W4A4) on long-context tasks.

NumbersLLaMA-2-13B Longtext avg: FP16 34.12, GPTQ W4 34.06, OmniQuant W4A4 16.35, WKVQuant W4KV4 32.52

Practical UseIf you need long-context accuracy, try WKVQuant W4KV4 instead of W4A4: similar accuracy to W4 and much better than W4A4 on long-text tasks.

Evidence RefTable 2/3

W4KV4 reduces decoding memory almost as much as weight+activation quantization.

NumbersLLaMA-2-13B bs=1 len=2048 memory: FP16 27.1GB → W4 8.0GB → W4KV4 6.8GB → W4A4 6.8GB

Practical UseYou can cut memory from ~27GB to ~7GB for 13B models using WKVQuant, enabling cheaper GPU deployments with little extra accuracy cost.

Evidence RefTable 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Longtext avg (LLaMA-2-13B)WKVQuant W4KV4 32.52FP16 34.12−1.60 vs FP16LongBench (multi long-text datasets)Table 2/3Table 2/3
Perplexity on WikiText2 (LLaMA-2-13B)WKVQuant W4KV4 5.00 pplFP16 4.88 ppl+0.12 pplWikiText2Table 2 (WikiText2 ppl)Table 2

What To Try In 7 Days

Run WKVQuant W4KV4 on one LLaMA-family checkpoint and compare memory vs FP16 and W4A4.

Enable Past-Only Quantization in decode to preserve accuracy and measure token-level output quality.

Calibrate quant params on 128 2048-token samples and use CRR with k=5; measure perplexity and LongBench scores.

Optimization Features

Token Efficiency
reduces KV cache memory footprint (e.g., 27.1GB→6.8GB for 13B)
Infra Optimization
calibration done on a single A100; optimization time ~3–4 hours (7B≈3h,13B≈4h)
Model Optimization
weight quantization (W4, group size 128)learnable clipping parameters (γ, β)OmniQuant-style weight PTQ
System Optimization
Accuracy
Training Optimization
Cross-block Reconstruction Regularization (CRR, MAE across k=5 blocks)optimize smoothing s and shift δ via AdamW for 5 epochs
Inference Optimization
Past-Only Quantization (POQ): full-precision current KV, quantized past KVTwo-dimensional Quantization: static channel smoothing + dynamic token-wise scaling

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Does not quantize temporary activations; in some prefill or very large-batch cases that raises memory compared to full activation quantization.

Cannot fully leverage platform INT8 acceleration because not all tensors share the same low bit-width.

When Not To Use

You need maximal hardware INT8 throughput across all ops and cannot mix precisions.

You plan to quantize temporary activations despite known high sensitivity.

Failure Modes

Disabling POQ sharply reduces accuracy on long-context tasks (ablation shows large drop).

Token-wise outliers can still damage per-token quantization if group sizes are inappropriate.

Core Entities

Models

LLaMA-2-13BLLaMA-2-7BLLaMA-7BLLaMA-13BGPTQOmniQuant

Metrics

Longtext avgZero-shot avgPerplexity (ppl)Decoding memory (GB)F1 (task datasets)

Datasets

WikiText2PTBC4LongBench (Qasper, 2WikiMultihopQA, HotpotQA, TriviaQA, LCC, MultiFieldQA-en)WikiText2 calibration segments

Benchmarks

Longtext avg (LongBench)Zero-shot avg (PIQA, ARC-Challenge, HellaSwag, WinoGrande)Perplexity (WikiText2, PTB, C4)