Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
13
Why It Matters For Business
OmniQuant lets teams quantize large models to very low-bit formats with PTQ-level data and time budgets, cutting weight storage and often doubling throughput while keeping runtime identical to standard quantized models.
Summary TLDR
OmniQuant is a post-training quantization (PTQ) pipeline that learns a small set of quantization parameters via gradient-based block-wise optimization. It adds Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to move quantization difficulty from activations to weights and to optimize clipping, while keeping full-precision weights frozen. With 128 calibration samples on a single A100 GPU, OmniQuant quantizes LLaMA/OPT/Falcon models (7B–180B) to low-bit formats (W4A4, W2A16, W6A6, W3A16) with large gains in perplexity and zero-shot accuracy, faster inference and much smaller weight storage, without adding runtime cost.
Problem Statement
Hand-crafted PTQ rules fail at very low bits (e.g., W4A4 or W2A16). Quantization-aware training (QAT) recovers accuracy but is expensive in GPU hours and data. Can we get QAT-level accuracy while keeping PTQ's time and data efficiency?
Main Contribution
A differentiable, block-wise PTQ pipeline (OmniQuant) that optimizes a small set of learnable quantization parameters instead of tuning all weights.
Learnable Weight Clipping (LWC) that adapts clipping strengths to reduce weight quantization error.
Learnable Equivalent Transformation (LET) that applies learned channel-wise scaling/shifting (including attention) to move activation outlier difficulty into weights; both fuse into final weights so no runtime cost is added.
Key Findings
OmniQuant turns catastrophic W2A16 degradation into usable models.
Large average zero-shot accuracy gains at aggressive W4A4 weight-activation quantization.
OmniQuant is practical on a single GPU with tiny calibration sets.
Quantized models reduce weight memory and accelerate inference in practice.
Results
Perplexity (generation)
Accuracy
Throughput (tokens/s) and weight memory
Calibration/training cost
Who Should Care
What To Try In 7 Days
Run OmniQuant on a 7B model with 128 calibration samples to measure memory and tokens/s gains.
Try W4A16g128 weight-only quantization first to get large memory savings and speedups with minimal accuracy drop.
If you must lower activation bits, run W4A4 with OmniQuant and evaluate zero-shot accuracy on key tasks before full deployment.
Agent Features
Tool Use
- MLC-LLM
Architectures
- transformer
Optimization Features
Token Efficiency
- Improves tokens/s vs FP16 in reported hardware (e.g., ~2× on 7B W4A16g128)
Infra Optimization
- Single-GPU calibrations (A100-40G/80G) feasible
Model Optimization
- Learnable Weight Clipping (LWC)
- Learnable Equivalent Transformation (LET)
- Block-wise differentiable parameter optimization
System Optimization
- Fuses learned scaling/clipping into stored weights for zero runtime overhead
Training Optimization
- Small calibration set (128 samples)
- Per-block SGD/AdamW on quantization parameters
Inference Optimization
- Per-channel and group-wise weight quantization (INT2/3/4/6)
- Per-token activation quantization (for weight-activation settings)
- No extra runtime ops; learned parameters fused into weights
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Requires per-model calibration and a small training pass (longer than pure GPTQ; shown ~5× slower than GPTQ).
- Some low-bit modes (INT2/INT3) lack efficient hardware support today.
- SoftMax outputs are sensitive: 4-bit SoftMax produced large degradation in experiments.
- LET gives limited benefit for some models (e.g., turned off for LLaMA in weight-only cases).
When Not To Use
- When you have zero GPU time and must use fully training-free PTQ like plain GPTQ.
- If your deployment hardware does not support the targeted low-bit integer formats.
- If you require truly lossless FP16 behavior for critical tasks.
Failure Modes
- Over-aggressive SoftMax quantization (<=4-bit) can break generation quality.
- Poor initialization or unstable gradients in LET can reduce benefit (some layers excluded in paper).
- Calibration bias: though authors show robustness across datasets, extreme domain mismatch could still hurt.
Core Entities
Models
- LLaMA (7B-65B)
- LLaMA-2 (7B-70B)
- OPT (125M-66B)
- Falcon-180B
- LLaMA-2-chat
- GPTQ
- AWQ
- SmoothQuant
- Outlier Suppression+
- LLM-QAT
Metrics
- Perplexity
- Accuracy
- Tokens per second (throughput)
- Weight memory (GB)
Datasets
- WikiText2
- C4
- PTB
- Pile
- Vicuna benchmark
- lm-eval-harness (zero-shot tasks: PIQA, ARC, BoolQ, HellaSwag, Winogrande)
Benchmarks
- Perplexity on WikiText2/C4/PTB
- Accuracy
- Vicuna GPT-4 pairwise evaluation

