Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
20
Why It Matters For Business
RPTQ reduces memory footprint on large language models by up to ~80%, enabling cheaper hosting, fewer GPUs per deployment, and longer context lengths for customers without retraining models.
Summary TLDR
RPTQ clusters activation channels that share similar numeric ranges, reorders them so channels in the same cluster are adjacent, and applies per-cluster static uniform quantization. The reorder is fused into layer‑norm writes and weight layouts to avoid runtime copies. On OPT models RPTQ enables 3-bit activation quantization for the first time in this work and cuts overall memory (weights + activations + KV cache) by ~73–80% on large configs while keeping perplexity and zero-shot accuracy close to FP16 on evaluated benchmarks.
Problem Statement
Activations in transformer LLMs vary widely across hidden-dimension channels. Per-tensor static quantization treats all channels the same and causes large errors. Existing fixes (outlier handling, per-channel scaling) do not solve the per-channel range differences efficiently for low-bit static PTQ.
Main Contribution
Identify that per-channel range differences (not only outliers) block low-bit activation PTQ in LLMs.
Propose RPTQ: cluster channels by their (min,max) range, reorder channels so cluster members are contiguous, then apply per-cluster static uniform quantization.
Eliminate runtime reorder cost by writing reordered outputs from layer‑norm and pre-reordering weight matrices so inference adds no extra memory copies.
Show RPTQ enables activation quantization down to 3 bits on OPT models and present KV-cache-only quantization variants (W4A4KV, W4A3KV, W3A3KV) to target the dominant memory consumer.
Key Findings
RPTQ enables low-bit activation quantization with small accuracy loss on large OPT models.
Targeting only key/value caches reduces memory dramatically with modest accuracy drop.
Clustering channels reduces quantization error; using more clusters generally lowers perplexity.
Reorder cost can be removed at runtime by fusing with layer norm writes and pre-reordering weights.
Results
Perplexity (OPT-175b, WikiText2)
Memory reduction (OPT family, typical configs)
KV-cache-only memory reduction (OPT-175b)
Perplexity gap vs FP16 (OPT-175b)
Who Should Care
What To Try In 7 Days
Run calibration on your model with 256–512 representative samples and apply RPTQ with 8 and 4-bit activations to measure perplexity and end-to-end memory
If KV cache dominates memory, try KV-only quantization (W4A4KV or W3A3KV) to get large wins quickly
Fuse reordering by exporting layer‑norm writes and pre-reordering linear weights to avoid runtime copies before production rollout
Optimization Features
Infra Optimization
- reduces GPU memory need, enabling fewer GPUs per model shard
Model Optimization
- weight quantization combined with GPTQ
System Optimization
- reduce memory transfers between devices by lowering activation/KV cache size
Training Optimization
- none (post-training method)
Inference Optimization
- fuse reorder into layer‑norm writes
- pre-reorder weight matrices to match activation order
- KV-cache-only quantization
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires per-model calibration; small calibration sets can cause instability in some reorders (R2/R3).
- 3-bit integer compute is not widely supported on current GPUs; runtime may cast low-bit integers to 4/8-bit and lose speed gains.
- Cluster counts and calibration size must be tuned per-model; larger models need more clusters and more calibration data.
When Not To Use
- If you cannot run a representative calibration set before deployment.
- If your inference hardware lacks efficient low-bit integer arithmetic and you cannot accept casting overhead.
- If you need dynamic quantization adapting to each input at runtime (RPTQ focuses on static PTQ).
Failure Modes
- Insufficient calibration data causes cluster misassignment and spikes in perplexity for some layers.
- Incorrectly applied reordering causing channel misalignment in residual or projection paths leads to wrong outputs.
- Hardware casting of unsupported low-bit types negates expected speed/memory benefits.
Core Entities
Models
- OPT-1.3b
- OPT-6.7b
- OPT-13b
- OPT-30b
- OPT-66b
- OPT-175b
Metrics
- perplexity
- Accuracy
- memory (GB) or percent reduction
Datasets
- WikiText2
- Penn Treebank
- C4
- LAMBADA
- PIQA
- ARC
- OpenBookQA
- BoolQ
Benchmarks
- perplexity
- Accuracy

