Quantize weights + KV cache (not all activations) to save large memory with much less accuracy loss

Overview

Decision SnapshotReady For Pilot

Scores reflect clear, reproducible PTQ experiments on multiple LLaMA variants with ablations; code release is not stated so reproduction requires re-implementation and calibration.

Citations2

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie

Links

Abstract / PDF

Why It Matters For Business

WKVQuant cuts decoding memory of 13B models from ~27GB to ~7GB while keeping accuracy near full-precision; this enables cheaper GPU options and larger batch/sequence support without retraining.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

WKVQuant is a post-training quantization (PTQ) recipe that targets model weights and the key/value (KV) cache only. It adds three components: Past-Only Quantization (use full-precision current KV, quantize past cache), two-dimensional quantization (channel smoothing + token-wise scaling), and cross-block reconstruction regularization (optimize quant params with downstream-aware loss). On LLaMA-family models W4KV4 (4-bit weights + 4-bit KV) gets memory close to full weight-activation quantization while keeping accuracy close to weight-only W4. Practical upshot: cut KV memory drastically with small accuracy loss and avoid the large accuracy collapse from quantizing short-lived temporary activi

Problem Statement

Deploying LLMs is memory-bound: model weights and the growing KV cache take most memory. Weight-only quantization keeps accuracy but saves limited memory. Weight+activation quantization saves more memory but often breaks accuracy because temporary activations and activation outliers are sensitive. We need a practical quantization that reduces KV cache memory without the accuracy collapse of full activation quantization.

Main Contribution

WKVQuant: a PTQ framework designed to quantize weights and KV cache only.

Past-Only Quantization (POQ): keep current token KV in full precision, quantize only past cached KV during decode.

Key Findings

WKVQuant (W4KV4) maintains long-input task performance close to full precision and weight-only quantization while far outperforming weight+activation (W4A4) on long-context tasks.

NumbersLLaMA-2-13B Longtext avg: FP16 34.12, GPTQ W4 34.06, OmniQuant W4A4 16.35, WKVQuant W4KV4 32.52

Practical UseIf you need long-context accuracy, try WKVQuant W4KV4 instead of W4A4: similar accuracy to W4 and much better than W4A4 on long-text tasks.

Evidence RefTable 2/3

W4KV4 reduces decoding memory almost as much as weight+activation quantization.

NumbersLLaMA-2-13B bs=1 len=2048 memory: FP16 27.1GB → W4 8.0GB → W4KV4 6.8GB → W4A4 6.8GB

Practical UseYou can cut memory from ~27GB to ~7GB for 13B models using WKVQuant, enabling cheaper GPU deployments with little extra accuracy cost.

Evidence RefTable 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Longtext avg (LLaMA-2-13B)	WKVQuant W4KV4 32.52	FP16 34.12	−1.60 vs FP16	LongBench (multi long-text datasets)	Table 2/3	Table 2/3
Perplexity on WikiText2 (LLaMA-2-13B)	WKVQuant W4KV4 5.00 ppl	FP16 4.88 ppl	+0.12 ppl	WikiText2	Table 2 (WikiText2 ppl)	Table 2

What To Try In 7 Days

Run WKVQuant W4KV4 on one LLaMA-family checkpoint and compare memory vs FP16 and W4A4.

Enable Past-Only Quantization in decode to preserve accuracy and measure token-level output quality.

Calibrate quant params on 128 2048-token samples and use CRR with k=5; measure perplexity and LongBench scores.

Optimization Features

Token Efficiency

reduces KV cache memory footprint (e.g., 27.1GB→6.8GB for 13B)

Infra Optimization

calibration done on a single A100; optimization time ~3–4 hours (7B≈3h,13B≈4h)

Model Optimization

weight quantization (W4, group size 128)learnable clipping parameters (γ, β)OmniQuant-style weight PTQ

System Optimization

Accuracy

Training Optimization

Cross-block Reconstruction Regularization (CRR, MAE across k=5 blocks)optimize smoothing s and shift δ via AdamW for 5 epochs

Inference Optimization

Past-Only Quantization (POQ): full-precision current KV, quantized past KVTwo-dimensional Quantization: static channel smoothing + dynamic token-wise scaling

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Does not quantize temporary activations; in some prefill or very large-batch cases that raises memory compared to full activation quantization.

Cannot fully leverage platform INT8 acceleration because not all tensors share the same low bit-width.

When Not To Use

You need maximal hardware INT8 throughput across all ops and cannot mix precisions.

You plan to quantize temporary activations despite known high sensitivity.

Failure Modes

Disabling POQ sharply reduces accuracy on long-context tasks (ablation shows large drop).

Token-wise outliers can still damage per-token quantization if group sizes are inappropriate.

Core Entities

Models

LLaMA-2-13BLLaMA-2-7BLLaMA-7BLLaMA-13BGPTQOmniQuant

Metrics

Longtext avgZero-shot avgPerplexity (ppl)Decoding memory (GB)F1 (task datasets)

Datasets

WikiText2PTBC4LongBench (Qasper, 2WikiMultihopQA, HotpotQA, TriviaQA, LCC, MultiFieldQA-en)WikiText2 calibration segments

Benchmarks

Longtext avg (LongBench)Zero-shot avg (PIQA, ARC-Challenge, HellaSwag, WinoGrande)Perplexity (WikiText2, PTB, C4)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

WKVQuant (W4KV4) maintains long-input task performance close to full precision and weight-only quantization while far outperforming weight+activation (W4A4) on long-context tasks.

W4KV4 reduces decoding memory almost as much as weight+activation quantization.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding