Overview
Scores reflect clear, reproducible PTQ experiments on multiple LLaMA variants with ablations; code release is not stated so reproduction requires re-implementation and calibration.
Citations2
Evidence Strength0.75
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
WKVQuant cuts decoding memory of 13B models from ~27GB to ~7GB while keeping accuracy near full-precision; this enables cheaper GPU options and larger batch/sequence support without retraining.
Who Should Care
Summary TLDR
WKVQuant is a post-training quantization (PTQ) recipe that targets model weights and the key/value (KV) cache only. It adds three components: Past-Only Quantization (use full-precision current KV, quantize past cache), two-dimensional quantization (channel smoothing + token-wise scaling), and cross-block reconstruction regularization (optimize quant params with downstream-aware loss). On LLaMA-family models W4KV4 (4-bit weights + 4-bit KV) gets memory close to full weight-activation quantization while keeping accuracy close to weight-only W4. Practical upshot: cut KV memory drastically with small accuracy loss and avoid the large accuracy collapse from quantizing short-lived temporary activi
Problem Statement
Deploying LLMs is memory-bound: model weights and the growing KV cache take most memory. Weight-only quantization keeps accuracy but saves limited memory. Weight+activation quantization saves more memory but often breaks accuracy because temporary activations and activation outliers are sensitive. We need a practical quantization that reduces KV cache memory without the accuracy collapse of full activation quantization.
Main Contribution
WKVQuant: a PTQ framework designed to quantize weights and KV cache only.
Past-Only Quantization (POQ): keep current token KV in full precision, quantize only past cached KV during decode.
Key Findings
WKVQuant (W4KV4) maintains long-input task performance close to full precision and weight-only quantization while far outperforming weight+activation (W4A4) on long-context tasks.
W4KV4 reduces decoding memory almost as much as weight+activation quantization.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Longtext avg (LLaMA-2-13B) | WKVQuant W4KV4 32.52 | FP16 34.12 | −1.60 vs FP16 | LongBench (multi long-text datasets) | Table 2/3 | Table 2/3 |
| Perplexity on WikiText2 (LLaMA-2-13B) | WKVQuant W4KV4 5.00 ppl | FP16 4.88 ppl | +0.12 ppl | WikiText2 | Table 2 (WikiText2 ppl) | Table 2 |
What To Try In 7 Days
Run WKVQuant W4KV4 on one LLaMA-family checkpoint and compare memory vs FP16 and W4A4.
Enable Past-Only Quantization in decode to preserve accuracy and measure token-level output quality.
Calibrate quant params on 128 2048-token samples and use CRR with k=5; measure perplexity and LongBench scores.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Does not quantize temporary activations; in some prefill or very large-batch cases that raises memory compared to full activation quantization.
Cannot fully leverage platform INT8 acceleration because not all tensors share the same low bit-width.
When Not To Use
You need maximal hardware INT8 throughput across all ops and cannot mix precisions.
You plan to quantize temporary activations despite known high sensitivity.
Failure Modes
Disabling POQ sharply reduces accuracy on long-context tasks (ablation shows large drop).
Token-wise outliers can still damage per-token quantization if group sizes are inappropriate.

