OmniQuant: learnable clipping and equivalent transforms give PTQ QAT-like quality for very low-bit LLM quantization

August 25, 20238 min

Overview

Decision SnapshotReady For Pilot

OmniQuant is a practical PTQ upgrade: experiments show large models quantized on one GPU using public datasets, with clear gains in perplexity, accuracy, memory and throughput; some hardware and bit-format support (INT2/INT3) may still be immature.

Citations13

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OmniQuant lets teams quantize large models to very low-bit formats with PTQ-level data and time budgets, cutting weight storage and often doubling throughput while keeping runtime identical to standard quantized models.

Who Should Care

Summary TLDR

OmniQuant is a post-training quantization (PTQ) pipeline that learns a small set of quantization parameters via gradient-based block-wise optimization. It adds Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to move quantization difficulty from activations to weights and to optimize clipping, while keeping full-precision weights frozen. With 128 calibration samples on a single A100 GPU, OmniQuant quantizes LLaMA/OPT/Falcon models (7B–180B) to low-bit formats (W4A4, W2A16, W6A6, W3A16) with large gains in perplexity and zero-shot accuracy, faster inference and much smaller weight storage, without adding runtime cost.

Problem Statement

Hand-crafted PTQ rules fail at very low bits (e.g., W4A4 or W2A16). Quantization-aware training (QAT) recovers accuracy but is expensive in GPU hours and data. Can we get QAT-level accuracy while keeping PTQ's time and data efficiency?

Main Contribution

A differentiable, block-wise PTQ pipeline (OmniQuant) that optimizes a small set of learnable quantization parameters instead of tuning all weights.

Learnable Weight Clipping (LWC) that adapts clipping strengths to reduce weight quantization error.

Key Findings

OmniQuant turns catastrophic W2A16 degradation into usable models.

NumbersLLaMA-13B W2A16 perplexity 13.21 vs GPTQ 3832 (paper text)

Practical UseYou can quantize large models to 2-bit weights (W2A16) and retain usable generation quality instead of catastrophic collapse.

Evidence RefIntroduction and Figure 1; Table 1

Large average zero-shot accuracy gains at aggressive W4A4 weight-activation quantization.

NumbersAverage accuracy improved by +4.99% to +11.80% across models at W4A4 (Section 4.3, Table 2)

Practical UseIf you need low-bit activation+weight inference (W4A4), OmniQuant often recovers substantial zero-shot task performance versus existing PTQ heuristics.

Evidence RefSection 4.3, Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (generation)LLaMA-13B W2A16 PPL=13.21 (OmniQuant)GPTQ PPL=3832 (reported in text)Huge reduction vs GPTQWikiText2 (reported in paper)Intro & Table 1Figure 1 & Table 1
AccuracyLLaMA-7B W4A4 avg=52.65% (OmniQuant)SmoothQuant avg=38.41%+14.24 percentage pointsSix zero-shot tasks (Table 2)Section 4.3, Table 2Table 2

What To Try In 7 Days

Run OmniQuant on a 7B model with 128 calibration samples to measure memory and tokens/s gains.

Try W4A16g128 weight-only quantization first to get large memory savings and speedups with minimal accuracy drop.

If you must lower activation bits, run W4A4 with OmniQuant and evaluate zero-shot accuracy on key tasks before full deployment.

Agent Features

Tool Use
MLC-LLM
Architectures
transformer

Optimization Features

Token Efficiency
Improves tokens/s vs FP16 in reported hardware (e.g., ~2× on 7B W4A16g128)
Infra Optimization
Single-GPU calibrations (A100-40G/80G) feasible
Model Optimization
Learnable Weight Clipping (LWC)Learnable Equivalent Transformation (LET)Block-wise differentiable parameter optimization
System Optimization
Fuses learned scaling/clipping into stored weights for zero runtime overhead
Training Optimization
Small calibration set (128 samples)Per-block SGD/AdamW on quantization parameters
Inference Optimization
Per-channel and group-wise weight quantization (INT2/3/4/6)Per-token activation quantization (for weight-activation settings)No extra runtime ops; learned parameters fused into weights

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Requires per-model calibration and a small training pass (longer than pure GPTQ; shown ~5× slower than GPTQ).

Some low-bit modes (INT2/INT3) lack efficient hardware support today.

When Not To Use

When you have zero GPU time and must use fully training-free PTQ like plain GPTQ.

If your deployment hardware does not support the targeted low-bit integer formats.

Failure Modes

Over-aggressive SoftMax quantization (<=4-bit) can break generation quality.

Poor initialization or unstable gradients in LET can reduce benefit (some layers excluded in paper).

Core Entities

Models

LLaMA (7B-65B)LLaMA-2 (7B-70B)OPT (125M-66B)Falcon-180BLLaMA-2-chatGPTQAWQSmoothQuantOutlier Suppression+LLM-QAT

Metrics

PerplexityAccuracyTokens per second (throughput)Weight memory (GB)

Datasets

WikiText2C4PTBPileVicuna benchmarklm-eval-harness (zero-shot tasks: PIQA, ARC, BoolQ, HellaSwag, Winogrande)

Benchmarks

Perplexity on WikiText2/C4/PTBAccuracyVicuna GPT-4 pairwise evaluation