Overview
Strong engineering and empirical results for encoder inference on Ampere GPUs. Limited for decoder models and depends on GPU support (CUTLASS).
Citations6
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 40%
Why It Matters For Business
On Ampere GPUs, INT4 computation can sharply reduce latency and cost for encoder-based workloads (search, classification, embedding). But it is risky to use for autoregressive generation (chatbots, text generation) until activation-quantization problems are solved.
Who Should Care
Summary TLDR
The paper shows 4-bit weight+activation (W4A4) quantization can keep accuracy for encoder (BERT) and encoder-decoder (BART) models while enabling large latency wins with optimized GPU kernels. Decoder-only models (GPT) suffer large quality loss from 4-bit activation quantization. The authors release a tuned INT4 encoder inference pipeline (CUTLASS-based) that achieves up to 8.5× latency speedup over FP16 and improves prior INT8 performance by up to 1.7×.
Problem Statement
Can full INT4 computation (weights and activations) be used for transformer inference to double hardware throughput and reduce latency, without unacceptable quality loss? And how to implement fast, end-to-end INT4 inference on GPUs?
Main Contribution
System: an end-to-end, highly optimized INT4 encoder inference pipeline (CUTLASS kernels, fused quant/dequant, FlashAttention, CUDA graph).
Empirical: broad QAT+KD study of W4A4 across model types showing encoder and encoder-decoder models tolerate W4A4, decoder-only models do not.
Key Findings
Encoder models (BERT) keep accuracy under W4A4 QAT+KD.
Encoder-decoder models (BART) show only small quality drops under W4A4.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 84.20 (FP32) → 84.31 (W4A4 symmetric) | FP32 | +0.11 | MNLI validation | Table 1 (BERT-base MNLI) | Table 1 |
| BART-base Rouge Lsum | 42.87 (FP32) → 41.92 (W4A4 symmetric) | FP32 | -0.95 | CNNDailyMail validation | Table 1 (BART-base) | Table 1 |
What To Try In 7 Days
Benchmark W4A4 encoder inference on a representative bs×seq using the authors' INT4 pipeline or CUTLASS kernels.
If using BERT/BART for classification or summarization, run QAT+KD with W4A4 on a held-out dataset and measure quality vs FP16.
Do not quantize GPT activations to INT4 yet; test weight-only 4-bit quantization (w4) or mixed W4A8 as a safer step.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
W4A4 fails or degrades decoding/generation (GPT) due to activation quantization sensitivity.
Results target NVIDIA Ampere GPUs and CUTLASS INT4; other hardware may not match gains.
When Not To Use
Autoregressive text generation (GPT-style) where generation quality matters.
Non-Ampere GPUs or hardware without efficient INT4 support.
Failure Modes
Activation quantization causes large early-token perplexity spikes in GPT (positional PPL gap >100 on early tokens).
Pretrained models can have wider activation ranges, making quantization harder than training-from-scratch.

