Quantize weights + KV cache (not all activations) to save large memory with much less accuracy loss

February 19, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie

Links

Abstract / PDF

Why It Matters For Business

WKVQuant cuts decoding memory of 13B models from ~27GB to ~7GB while keeping accuracy near full-precision; this enables cheaper GPU options and larger batch/sequence support without retraining.

Summary TLDR

WKVQuant is a post-training quantization (PTQ) recipe that targets model weights and the key/value (KV) cache only. It adds three components: Past-Only Quantization (use full-precision current KV, quantize past cache), two-dimensional quantization (channel smoothing + token-wise scaling), and cross-block reconstruction regularization (optimize quant params with downstream-aware loss). On LLaMA-family models W4KV4 (4-bit weights + 4-bit KV) gets memory close to full weight-activation quantization while keeping accuracy close to weight-only W4. Practical upshot: cut KV memory drastically with small accuracy loss and avoid the large accuracy collapse from quantizing short-lived temporary activi

Problem Statement

Deploying LLMs is memory-bound: model weights and the growing KV cache take most memory. Weight-only quantization keeps accuracy but saves limited memory. Weight+activation quantization saves more memory but often breaks accuracy because temporary activations and activation outliers are sensitive. We need a practical quantization that reduces KV cache memory without the accuracy collapse of full activation quantization.

Main Contribution

WKVQuant: a PTQ framework designed to quantize weights and KV cache only.

Past-Only Quantization (POQ): keep current token KV in full precision, quantize only past cached KV during decode.

Two-dimensional Quantization: static channel-wise smoothing plus dynamic token-wise scaling to reduce KV quantization error.

Cross-block Reconstruction Regularization (CRR): optimize quantization parameters with a downstream-aware MAE loss across subsequent blocks (k=5).

Empirical result: W4KV4 saves KV memory like W4A4 but keeps accuracy close to W4 (weight-only) on LLaMA/LLaMA2.

Key Findings

WKVQuant (W4KV4) maintains long-input task performance close to full precision and weight-only quantization while far outperforming weight+activation (W4A4) on long-context tasks.

NumbersLLaMA-2-13B Longtext avg: FP16 34.12, GPTQ W4 34.06, OmniQuant W4A4 16.35, WKVQuant W4KV4 32.52

W4KV4 reduces decoding memory almost as much as weight+activation quantization.

NumbersLLaMA-2-13B bs=1 len=2048 memory: FP16 27.1GB → W4 8.0GB → W4KV4 6.8GB → W4A4 6.8GB

Past-Only Quantization is the single most critical component for accuracy in WKVQuant.

NumbersAblation: full WKVQuant score 25.29 → without POQ 19.95 (−5.34)

Quantizing temporary activations causes catastrophic degradation.

NumbersLLaMA-2-13B perplexity: FP16 4.88 → INT4 temporary activations 785.56

Results

Longtext avg (LLaMA-2-13B)

ValueWKVQuant W4KV4 32.52

BaselineFP16 34.12

Perplexity on WikiText2 (LLaMA-2-13B)

ValueWKVQuant W4KV4 5.00 ppl

BaselineFP16 4.88 ppl

Decoding memory (bs=1, len=2048, LLaMA-2-13B)

ValueW4KV4 6.8 GB

BaselineFP16 27.1 GB

Ablation: remove POQ (LLaMA-2-7B)

ValueWKVQuant 25.29 → without POQ 19.95

BaselineWKVQuant

Temporary activation quantization risk

ValueINT4 temporary activations ppl 785.56

BaselineFP16 ppl 4.88

Who Should Care

What To Try In 7 Days

Run WKVQuant W4KV4 on one LLaMA-family checkpoint and compare memory vs FP16 and W4A4.

Enable Past-Only Quantization in decode to preserve accuracy and measure token-level output quality.

Calibrate quant params on 128 2048-token samples and use CRR with k=5; measure perplexity and LongBench scores.

Optimization Features

Token Efficiency

  • reduces KV cache memory footprint (e.g., 27.1GB→6.8GB for 13B)

Infra Optimization

  • calibration done on a single A100; optimization time ~3–4 hours (7B≈3h,13B≈4h)

Model Optimization

  • weight quantization (W4, group size 128)
  • learnable clipping parameters (γ, β)
  • OmniQuant-style weight PTQ

System Optimization

  • Accuracy

Training Optimization

  • Cross-block Reconstruction Regularization (CRR, MAE across k=5 blocks)
  • optimize smoothing s and shift δ via AdamW for 5 epochs

Inference Optimization

  • Past-Only Quantization (POQ): full-precision current KV, quantized past KV
  • Two-dimensional Quantization: static channel smoothing + dynamic token-wise scaling

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Does not quantize temporary activations; in some prefill or very large-batch cases that raises memory compared to full activation quantization.
  • Cannot fully leverage platform INT8 acceleration because not all tensors share the same low bit-width.
  • CRR calibration adds optimization time (≈3h for 7B, ≈4h for 13B) prior to deployment.
  • Evaluations are on LLaMA-family models; behavior on other architectures is untested.

When Not To Use

  • You need maximal hardware INT8 throughput across all ops and cannot mix precisions.
  • You plan to quantize temporary activations despite known high sensitivity.
  • You cannot afford the one-time CRR calibration step or per-model tuning.

Failure Modes

  • Disabling POQ sharply reduces accuracy on long-context tasks (ablation shows large drop).
  • Token-wise outliers can still damage per-token quantization if group sizes are inappropriate.
  • Suboptimal calibration dataset or group-size choices (notably group=128 used here) can raise perplexity.

Core Entities

Models

  • LLaMA-2-13B
  • LLaMA-2-7B
  • LLaMA-7B
  • LLaMA-13B
  • GPTQ
  • OmniQuant

Metrics

  • Longtext avg
  • Zero-shot avg
  • Perplexity (ppl)
  • Decoding memory (GB)
  • F1 (task datasets)

Datasets

  • WikiText2
  • PTB
  • C4
  • LongBench (Qasper, 2WikiMultihopQA, HotpotQA, TriviaQA, LCC, MultiFieldQA-en)
  • WikiText2 calibration segments

Benchmarks

  • Longtext avg (LongBench)
  • Zero-shot avg (PIQA, ARC-Challenge, HellaSwag, WinoGrande)
  • Perplexity (WikiText2, PTB, C4)