Cluster activation channels by range, reorder them, then quantize—cuts LLM activation memory up to ~80% while keeping accuracy near FP16

April 3, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

20

Authors

Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu

Links

Abstract / PDF

Why It Matters For Business

RPTQ reduces memory footprint on large language models by up to ~80%, enabling cheaper hosting, fewer GPUs per deployment, and longer context lengths for customers without retraining models.

Summary TLDR

RPTQ clusters activation channels that share similar numeric ranges, reorders them so channels in the same cluster are adjacent, and applies per-cluster static uniform quantization. The reorder is fused into layer‑norm writes and weight layouts to avoid runtime copies. On OPT models RPTQ enables 3-bit activation quantization for the first time in this work and cuts overall memory (weights + activations + KV cache) by ~73–80% on large configs while keeping perplexity and zero-shot accuracy close to FP16 on evaluated benchmarks.

Problem Statement

Activations in transformer LLMs vary widely across hidden-dimension channels. Per-tensor static quantization treats all channels the same and causes large errors. Existing fixes (outlier handling, per-channel scaling) do not solve the per-channel range differences efficiently for low-bit static PTQ.

Main Contribution

Identify that per-channel range differences (not only outliers) block low-bit activation PTQ in LLMs.

Propose RPTQ: cluster channels by their (min,max) range, reorder channels so cluster members are contiguous, then apply per-cluster static uniform quantization.

Eliminate runtime reorder cost by writing reordered outputs from layer‑norm and pre-reordering weight matrices so inference adds no extra memory copies.

Show RPTQ enables activation quantization down to 3 bits on OPT models and present KV-cache-only quantization variants (W4A4KV, W4A3KV, W3A3KV) to target the dominant memory consumer.

Key Findings

RPTQ enables low-bit activation quantization with small accuracy loss on large OPT models.

NumbersOPT-175b: W4A8 perplexity loss <0.5; W4A4 loss <3 (evaluated datasets)

Targeting only key/value caches reduces memory dramatically with modest accuracy drop.

NumbersW4A4KV reduces memory ~73%; W3A3KV reduces memory ~80% (evaluated configs)

Clustering channels reduces quantization error; using more clusters generally lowers perplexity.

NumbersAblation: perplexity decreases as clusters increase; authors used 32 clusters for R1/R4/R5 and 4 for R2/R3

Reorder cost can be removed at runtime by fusing with layer norm writes and pre-reordering weights.

NumbersReorder fused into layer‑norm and weight layout so inference has 'zero overhead related to reordering'

Results

Perplexity (OPT-175b, WikiText2)

ValueFP16 8.34 vs W4A8 8.43

BaselineFP16

Memory reduction (OPT family, typical configs)

ValueW4A8 ~63% reduction; W4A4 ~75% reduction

BaselineW16A16 / FP16

KV-cache-only memory reduction (OPT-175b)

ValueW4A4KV ~73% ; W3A3KV ~80%

BaselineFP16 weights/activations

Perplexity gap vs FP16 (OPT-175b)

ValueW4A8 gap <0.5; W4A4 gap <3

BaselineFP16

Who Should Care

What To Try In 7 Days

Run calibration on your model with 256–512 representative samples and apply RPTQ with 8 and 4-bit activations to measure perplexity and end-to-end memory

If KV cache dominates memory, try KV-only quantization (W4A4KV or W3A3KV) to get large wins quickly

Fuse reordering by exporting layer‑norm writes and pre-reordering linear weights to avoid runtime copies before production rollout

Optimization Features

Infra Optimization

  • reduces GPU memory need, enabling fewer GPUs per model shard

Model Optimization

  • weight quantization combined with GPTQ

System Optimization

  • reduce memory transfers between devices by lowering activation/KV cache size

Training Optimization

  • none (post-training method)

Inference Optimization

  • fuse reorder into layer‑norm writes
  • pre-reorder weight matrices to match activation order
  • KV-cache-only quantization

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires per-model calibration; small calibration sets can cause instability in some reorders (R2/R3).
  • 3-bit integer compute is not widely supported on current GPUs; runtime may cast low-bit integers to 4/8-bit and lose speed gains.
  • Cluster counts and calibration size must be tuned per-model; larger models need more clusters and more calibration data.

When Not To Use

  • If you cannot run a representative calibration set before deployment.
  • If your inference hardware lacks efficient low-bit integer arithmetic and you cannot accept casting overhead.
  • If you need dynamic quantization adapting to each input at runtime (RPTQ focuses on static PTQ).

Failure Modes

  • Insufficient calibration data causes cluster misassignment and spikes in perplexity for some layers.
  • Incorrectly applied reordering causing channel misalignment in residual or projection paths leads to wrong outputs.
  • Hardware casting of unsupported low-bit types negates expected speed/memory benefits.

Core Entities

Models

  • OPT-1.3b
  • OPT-6.7b
  • OPT-13b
  • OPT-30b
  • OPT-66b
  • OPT-175b

Metrics

  • perplexity
  • Accuracy
  • memory (GB) or percent reduction

Datasets

  • WikiText2
  • Penn Treebank
  • C4
  • LAMBADA
  • PIQA
  • ARC
  • OpenBookQA
  • BoolQ

Benchmarks

  • perplexity
  • Accuracy