OmniQuant: learnable clipping and equivalent transforms give PTQ QAT-like quality for very low-bit LLM quantization

August 25, 20238 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

13

Authors

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo

Links

Abstract / PDF

Why It Matters For Business

OmniQuant lets teams quantize large models to very low-bit formats with PTQ-level data and time budgets, cutting weight storage and often doubling throughput while keeping runtime identical to standard quantized models.

Summary TLDR

OmniQuant is a post-training quantization (PTQ) pipeline that learns a small set of quantization parameters via gradient-based block-wise optimization. It adds Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to move quantization difficulty from activations to weights and to optimize clipping, while keeping full-precision weights frozen. With 128 calibration samples on a single A100 GPU, OmniQuant quantizes LLaMA/OPT/Falcon models (7B–180B) to low-bit formats (W4A4, W2A16, W6A6, W3A16) with large gains in perplexity and zero-shot accuracy, faster inference and much smaller weight storage, without adding runtime cost.

Problem Statement

Hand-crafted PTQ rules fail at very low bits (e.g., W4A4 or W2A16). Quantization-aware training (QAT) recovers accuracy but is expensive in GPU hours and data. Can we get QAT-level accuracy while keeping PTQ's time and data efficiency?

Main Contribution

A differentiable, block-wise PTQ pipeline (OmniQuant) that optimizes a small set of learnable quantization parameters instead of tuning all weights.

Learnable Weight Clipping (LWC) that adapts clipping strengths to reduce weight quantization error.

Learnable Equivalent Transformation (LET) that applies learned channel-wise scaling/shifting (including attention) to move activation outlier difficulty into weights; both fuse into final weights so no runtime cost is added.

Key Findings

OmniQuant turns catastrophic W2A16 degradation into usable models.

NumbersLLaMA-13B W2A16 perplexity 13.21 vs GPTQ 3832 (paper text)

Large average zero-shot accuracy gains at aggressive W4A4 weight-activation quantization.

NumbersAverage accuracy improved by +4.99% to +11.80% across models at W4A4 (Section 4.3, Table 2)

OmniQuant is practical on a single GPU with tiny calibration sets.

NumbersUses 128 samples; LLaMA-7B weight-only 1.1h, weight-activation 1.6h on one A100 (Table A12)

Quantized models reduce weight memory and accelerate inference in practice.

NumbersLLaMA-7B weight memory 12.6G→3.8G and tokens/s 69.2→134.2 for W4A16g128 (Table 3)

Results

Perplexity (generation)

ValueLLaMA-13B W2A16 PPL=13.21 (OmniQuant)

BaselineGPTQ PPL=3832 (reported in text)

Accuracy

ValueLLaMA-7B W4A4 avg=52.65% (OmniQuant)

BaselineSmoothQuant avg=38.41%

Throughput (tokens/s) and weight memory

Value7B FP16 tokens/s 69.2, WM 12.6G → W4A16g128 tokens/s 134.2, WM 3.8G

BaselineFP16

Calibration/training cost

Value128 samples, 20 epochs (40 for W2A16); LLaMA-7B weight-only 1.1h, weight-activation 1.6h on single A100

BaselineQAT requires hundreds of GPU hours; GPTQ completes some models in ~1h

Who Should Care

What To Try In 7 Days

Run OmniQuant on a 7B model with 128 calibration samples to measure memory and tokens/s gains.

Try W4A16g128 weight-only quantization first to get large memory savings and speedups with minimal accuracy drop.

If you must lower activation bits, run W4A4 with OmniQuant and evaluate zero-shot accuracy on key tasks before full deployment.

Agent Features

Tool Use

  • MLC-LLM

Architectures

  • transformer

Optimization Features

Token Efficiency

  • Improves tokens/s vs FP16 in reported hardware (e.g., ~2× on 7B W4A16g128)

Infra Optimization

  • Single-GPU calibrations (A100-40G/80G) feasible

Model Optimization

  • Learnable Weight Clipping (LWC)
  • Learnable Equivalent Transformation (LET)
  • Block-wise differentiable parameter optimization

System Optimization

  • Fuses learned scaling/clipping into stored weights for zero runtime overhead

Training Optimization

  • Small calibration set (128 samples)
  • Per-block SGD/AdamW on quantization parameters

Inference Optimization

  • Per-channel and group-wise weight quantization (INT2/3/4/6)
  • Per-token activation quantization (for weight-activation settings)
  • No extra runtime ops; learned parameters fused into weights

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Requires per-model calibration and a small training pass (longer than pure GPTQ; shown ~5× slower than GPTQ).
  • Some low-bit modes (INT2/INT3) lack efficient hardware support today.
  • SoftMax outputs are sensitive: 4-bit SoftMax produced large degradation in experiments.
  • LET gives limited benefit for some models (e.g., turned off for LLaMA in weight-only cases).

When Not To Use

  • When you have zero GPU time and must use fully training-free PTQ like plain GPTQ.
  • If your deployment hardware does not support the targeted low-bit integer formats.
  • If you require truly lossless FP16 behavior for critical tasks.

Failure Modes

  • Over-aggressive SoftMax quantization (<=4-bit) can break generation quality.
  • Poor initialization or unstable gradients in LET can reduce benefit (some layers excluded in paper).
  • Calibration bias: though authors show robustness across datasets, extreme domain mismatch could still hurt.

Core Entities

Models

  • LLaMA (7B-65B)
  • LLaMA-2 (7B-70B)
  • OPT (125M-66B)
  • Falcon-180B
  • LLaMA-2-chat
  • GPTQ
  • AWQ
  • SmoothQuant
  • Outlier Suppression+
  • LLM-QAT

Metrics

  • Perplexity
  • Accuracy
  • Tokens per second (throughput)
  • Weight memory (GB)

Datasets

  • WikiText2
  • C4
  • PTB
  • Pile
  • Vicuna benchmark
  • lm-eval-harness (zero-shot tasks: PIQA, ARC, BoolQ, HellaSwag, Winogrande)

Benchmarks

  • Perplexity on WikiText2/C4/PTB
  • Accuracy
  • Vicuna GPT-4 pairwise evaluation