Overview
Survey synthesizes many reproduced results: 8-bit quantization is mature for production; 4-bit is promising with QAT; binary and sub-2-bit remain experimental and risky.
Citations3
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 30%
Why It Matters For Business
Quantizing ViTs to 8-bit often preserves accuracy while halving memory and improving throughput on INT8-capable hardware, enabling real-time and edge deployment with lower cost.
Who Should Care
Summary TLDR
This paper surveys methods for reducing the size and compute cost of Vision Transformers (ViTs) by quantizing weights/activations and by designing hardware that exploits low-bit formats. It explains ViT runtime bottlenecks, reviews PTQ (post-training), QAT (quant-aware training), data-free methods, binary approaches, integer approximations for non-linear ops, and FPGA/ASIC/GPU accelerators. The survey highlights that 8-bit quantization is routinely near-lossless on ImageNet, QAT can push useful accuracy to 4-bit, while binary (<2-bit) methods still suffer large accuracy drops and need more research.
Problem Statement
Vision Transformers perform well but are large and compute-heavy. Their self-attention grows quadratically with token count, making deployment on edge and low-latency systems hard. The paper asks: which quantization algorithms and hardware designs let ViTs run efficiently without unacceptable accuracy loss?
Main Contribution
Clear taxonomy and comparison of ViT quantization methods (PTQ, QAT, DFQ, binary).
Operation-level analysis of ViT bottlenecks using a roofline perspective and per-op FLOPs/MOPs.
Key Findings
8-bit quantization typically keeps near-original ImageNet accuracy for DeiT-Base.
Quantization-aware training (QAT) can preserve or exceed FP32 accuracy at 4-bit on evaluated models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 82.67 (Evol-Q) / 81.48 (PTQ4ViT) / 81.20 (FQ-ViT) | 81.85 FP32 | +0.82 / -0.37 / -0.65 | ImageNet | Table III | Table III |
| Accuracy | 83.00 (Q-ViT) | 81.85 FP32 | +1.15 | ImageNet | Table III | Table III |
What To Try In 7 Days
Profile your ViT model for per-op FLOPs and memory (use roofline to find memory vs compute bound layers).
Run 8-bit PTQ on a small calibration set and measure Top-1 accuracy and latency on target hardware.
If 4-bit is required, prototype QAT with knowledge distillation on a trimmed dataset and compare gain vs training cost.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
PTQ often fails below 8-bit without careful calibration or retraining.
Binary quantization loses >10% top-1 on ImageNet for standard ViT backbones.
When Not To Use
Do not use naive PTQ when you need reliable sub-8-bit accuracy.
Avoid full binary quantization for tasks where accuracy matters (detection, medical imaging).
Failure Modes
Outlier channels dominate scale selection and degrade accuracy.
Weight oscillation during QAT causes instability and accuracy loss.

