Cluster activation channels by range, reorder them, then quantize—cuts LLM activation memory up to ~80% while keeping accuracy near FP16

Overview

Decision SnapshotNeeds Validation

Method is practical (post-training, static quantization) and provides clear memory wins; sensitivity to calibration size, cluster tuning, and GPU integer support lowers turnkey readiness.

Citations20

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu

Links

Abstract / PDF / Code

Why It Matters For Business

RPTQ reduces memory footprint on large language models by up to ~80%, enabling cheaper hosting, fewer GPUs per deployment, and longer context lengths for customers without retraining models.

Who Should Care

ML Engineer Engineering Lead CTO Founder Data Scientist

Summary TLDR

RPTQ clusters activation channels that share similar numeric ranges, reorders them so channels in the same cluster are adjacent, and applies per-cluster static uniform quantization. The reorder is fused into layer‑norm writes and weight layouts to avoid runtime copies. On OPT models RPTQ enables 3-bit activation quantization for the first time in this work and cuts overall memory (weights + activations + KV cache) by ~73–80% on large configs while keeping perplexity and zero-shot accuracy close to FP16 on evaluated benchmarks.

Problem Statement

Activations in transformer LLMs vary widely across hidden-dimension channels. Per-tensor static quantization treats all channels the same and causes large errors. Existing fixes (outlier handling, per-channel scaling) do not solve the per-channel range differences efficiently for low-bit static PTQ.

Main Contribution

Identify that per-channel range differences (not only outliers) block low-bit activation PTQ in LLMs.

Propose RPTQ: cluster channels by their (min,max) range, reorder channels so cluster members are contiguous, then apply per-cluster static uniform quantization.

Key Findings

RPTQ enables low-bit activation quantization with small accuracy loss on large OPT models.

NumbersOPT-175b: W4A8 perplexity loss <0.5; W4A4 loss <3 (evaluated datasets)

Practical UseYou can run OPT-175b with 4- or 8-bit activations and expect near-FP16 perplexity on standard text benchmarks after calibration.

Evidence RefAbstract; Sec 5.2; Table 1

Targeting only key/value caches reduces memory dramatically with modest accuracy drop.

NumbersW4A4KV reduces memory ~73%; W3A3KV reduces memory ~80% (evaluated configs)

Practical UseIf KV cache dominates your memory (long contexts or large batches), quantize only KV to get most of the memory win while preserving model outputs.

Evidence RefAbstract; Sec 5.3; Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (OPT-175b, WikiText2)	FP16 8.34 vs W4A8 8.43	FP16	+0.09	WikiText2	Table 1 (OPT-175b WIKI)	Table 1
Memory reduction (OPT family, typical configs)	W4A8 ~63% reduction; W4A4 ~75% reduction	W16A16 / FP16	—	Server batch/seq configs in Table 3	Table 3; Sec 5.3	Table 3

What To Try In 7 Days

Run calibration on your model with 256–512 representative samples and apply RPTQ with 8 and 4-bit activations to measure perplexity and end-to-end memory

If KV cache dominates memory, try KV-only quantization (W4A4KV or W3A3KV) to get large wins quickly

Fuse reordering by exporting layer‑norm writes and pre-reordering linear weights to avoid runtime copies before production rollout

Optimization Features

Infra Optimization

reduces GPU memory need, enabling fewer GPUs per model shard

Model Optimization

weight quantization combined with GPTQ

System Optimization

reduce memory transfers between devices by lowering activation/KV cache size

Training Optimization

none (post-training method)

Inference Optimization

fuse reorder into layer‑norm writespre-reorder weight matrices to match activation orderKV-cache-only quantization

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hahnyuan/RPTQ4LLM

Risks & Boundaries

Limitations

Requires per-model calibration; small calibration sets can cause instability in some reorders (R2/R3).

3-bit integer compute is not widely supported on current GPUs; runtime may cast low-bit integers to 4/8-bit and lose speed gains.

When Not To Use

If you cannot run a representative calibration set before deployment.

If your inference hardware lacks efficient low-bit integer arithmetic and you cannot accept casting overhead.

Failure Modes

Insufficient calibration data causes cluster misassignment and spikes in perplexity for some layers.

Incorrectly applied reordering causing channel misalignment in residual or projection paths leads to wrong outputs.

Core Entities

Models

OPT-1.3bOPT-6.7bOPT-13bOPT-30bOPT-66bOPT-175b

Metrics

perplexityAccuracymemory (GB) or percent reduction

Datasets

WikiText2Penn TreebankC4LAMBADAPIQAARCOpenBookQABoolQ

Benchmarks

perplexityAccuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RPTQ enables low-bit activation quantization with small accuracy loss on large OPT models.

Targeting only key/value caches reduces memory dramatically with modest accuracy drop.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding