OmniQuant: learnable clipping and equivalent transforms give PTQ QAT-like quality for very low-bit LLM quantization

Overview

Decision SnapshotReady For Pilot

OmniQuant is a practical PTQ upgrade: experiments show large models quantized on one GPU using public datasets, with clear gains in perplexity, accuracy, memory and throughput; some hardware and bit-format support (INT2/INT3) may still be immature.

Citations13

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OmniQuant lets teams quantize large models to very low-bit formats with PTQ-level data and time budgets, cutting weight storage and often doubling throughput while keeping runtime identical to standard quantized models.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO Product Manager

Summary TLDR

OmniQuant is a post-training quantization (PTQ) pipeline that learns a small set of quantization parameters via gradient-based block-wise optimization. It adds Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to move quantization difficulty from activations to weights and to optimize clipping, while keeping full-precision weights frozen. With 128 calibration samples on a single A100 GPU, OmniQuant quantizes LLaMA/OPT/Falcon models (7B–180B) to low-bit formats (W4A4, W2A16, W6A6, W3A16) with large gains in perplexity and zero-shot accuracy, faster inference and much smaller weight storage, without adding runtime cost.

Problem Statement

Hand-crafted PTQ rules fail at very low bits (e.g., W4A4 or W2A16). Quantization-aware training (QAT) recovers accuracy but is expensive in GPU hours and data. Can we get QAT-level accuracy while keeping PTQ's time and data efficiency?

Main Contribution

A differentiable, block-wise PTQ pipeline (OmniQuant) that optimizes a small set of learnable quantization parameters instead of tuning all weights.

Learnable Weight Clipping (LWC) that adapts clipping strengths to reduce weight quantization error.

Key Findings

OmniQuant turns catastrophic W2A16 degradation into usable models.

NumbersLLaMA-13B W2A16 perplexity 13.21 vs GPTQ 3832 (paper text)

Practical UseYou can quantize large models to 2-bit weights (W2A16) and retain usable generation quality instead of catastrophic collapse.

Evidence RefIntroduction and Figure 1; Table 1

Large average zero-shot accuracy gains at aggressive W4A4 weight-activation quantization.

NumbersAverage accuracy improved by +4.99% to +11.80% across models at W4A4 (Section 4.3, Table 2)

Practical UseIf you need low-bit activation+weight inference (W4A4), OmniQuant often recovers substantial zero-shot task performance versus existing PTQ heuristics.

Evidence RefSection 4.3, Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (generation)	LLaMA-13B W2A16 PPL=13.21 (OmniQuant)	GPTQ PPL=3832 (reported in text)	Huge reduction vs GPTQ	WikiText2 (reported in paper)	Intro & Table 1	Figure 1 & Table 1
Accuracy	LLaMA-7B W4A4 avg=52.65% (OmniQuant)	SmoothQuant avg=38.41%	+14.24 percentage points	Six zero-shot tasks (Table 2)	Section 4.3, Table 2	Table 2

What To Try In 7 Days

Run OmniQuant on a 7B model with 128 calibration samples to measure memory and tokens/s gains.

Try W4A16g128 weight-only quantization first to get large memory savings and speedups with minimal accuracy drop.

If you must lower activation bits, run W4A4 with OmniQuant and evaluate zero-shot accuracy on key tasks before full deployment.

Agent Features

Tool Use

MLC-LLM

Architectures

transformer

Optimization Features

Token Efficiency

Improves tokens/s vs FP16 in reported hardware (e.g., ~2× on 7B W4A16g128)

Infra Optimization

Single-GPU calibrations (A100-40G/80G) feasible

Model Optimization

Learnable Weight Clipping (LWC)Learnable Equivalent Transformation (LET)Block-wise differentiable parameter optimization

System Optimization

Fuses learned scaling/clipping into stored weights for zero runtime overhead

Training Optimization

Small calibration set (128 samples)Per-block SGD/AdamW on quantization parameters

Inference Optimization

Per-channel and group-wise weight quantization (INT2/3/4/6)Per-token activation quantization (for weight-activation settings)No extra runtime ops; learned parameters fused into weights

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/OpenGVLab/OmniQuant

Data URLs

https://huggingface.co/datasets/wikitext https://huggingface.co/datasets/c4

Risks & Boundaries

Limitations

Requires per-model calibration and a small training pass (longer than pure GPTQ; shown ~5× slower than GPTQ).

Some low-bit modes (INT2/INT3) lack efficient hardware support today.

When Not To Use

When you have zero GPU time and must use fully training-free PTQ like plain GPTQ.

If your deployment hardware does not support the targeted low-bit integer formats.

Failure Modes

Over-aggressive SoftMax quantization (<=4-bit) can break generation quality.

Poor initialization or unstable gradients in LET can reduce benefit (some layers excluded in paper).

Core Entities

Models

LLaMA (7B-65B)LLaMA-2 (7B-70B)OPT (125M-66B)Falcon-180BLLaMA-2-chatGPTQAWQSmoothQuantOutlier Suppression+LLM-QAT

Metrics

PerplexityAccuracyTokens per second (throughput)Weight memory (GB)

Datasets

WikiText2C4PTBPileVicuna benchmarklm-eval-harness (zero-shot tasks: PIQA, ARC, BoolQ, HellaSwag, Winogrande)

Benchmarks

Perplexity on WikiText2/C4/PTBAccuracyVicuna GPT-4 pairwise evaluation

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

OmniQuant turns catastrophic W2A16 degradation into usable models.

Large average zero-shot accuracy gains at aggressive W4A4 weight-activation quantization.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding