Overview
Method is practical (post-training, static quantization) and provides clear memory wins; sensitivity to calibration size, cluster tuning, and GPU integer support lowers turnkey readiness.
Citations20
Evidence Strength0.60
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
RPTQ reduces memory footprint on large language models by up to ~80%, enabling cheaper hosting, fewer GPUs per deployment, and longer context lengths for customers without retraining models.
Who Should Care
Summary TLDR
RPTQ clusters activation channels that share similar numeric ranges, reorders them so channels in the same cluster are adjacent, and applies per-cluster static uniform quantization. The reorder is fused into layer‑norm writes and weight layouts to avoid runtime copies. On OPT models RPTQ enables 3-bit activation quantization for the first time in this work and cuts overall memory (weights + activations + KV cache) by ~73–80% on large configs while keeping perplexity and zero-shot accuracy close to FP16 on evaluated benchmarks.
Problem Statement
Activations in transformer LLMs vary widely across hidden-dimension channels. Per-tensor static quantization treats all channels the same and causes large errors. Existing fixes (outlier handling, per-channel scaling) do not solve the per-channel range differences efficiently for low-bit static PTQ.
Main Contribution
Identify that per-channel range differences (not only outliers) block low-bit activation PTQ in LLMs.
Propose RPTQ: cluster channels by their (min,max) range, reorder channels so cluster members are contiguous, then apply per-cluster static uniform quantization.
Key Findings
RPTQ enables low-bit activation quantization with small accuracy loss on large OPT models.
Targeting only key/value caches reduces memory dramatically with modest accuracy drop.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (OPT-175b, WikiText2) | FP16 8.34 vs W4A8 8.43 | FP16 | +0.09 | WikiText2 | Table 1 (OPT-175b WIKI) | Table 1 |
| Memory reduction (OPT family, typical configs) | W4A8 ~63% reduction; W4A4 ~75% reduction | W16A16 / FP16 | — | Server batch/seq configs in Table 3 | Table 3; Sec 5.3 | Table 3 |
What To Try In 7 Days
Run calibration on your model with 256–512 representative samples and apply RPTQ with 8 and 4-bit activations to measure perplexity and end-to-end memory
If KV cache dominates memory, try KV-only quantization (W4A4KV or W3A3KV) to get large wins quickly
Fuse reordering by exporting layer‑norm writes and pre-reordering linear weights to avoid runtime copies before production rollout
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Requires per-model calibration; small calibration sets can cause instability in some reorders (R2/R3).
3-bit integer compute is not widely supported on current GPUs; runtime may cast low-bit integers to 4/8-bit and lose speed gains.
When Not To Use
If you cannot run a representative calibration set before deployment.
If your inference hardware lacks efficient low-bit integer arithmetic and you cannot accept casting overhead.
Failure Modes
Insufficient calibration data causes cluster misassignment and spikes in perplexity for some layers.
Incorrectly applied reordering causing channel misalignment in residual or projection paths leads to wrong outputs.

