INT4 (4-bit) gives big latency wins for encoder models with little accuracy loss, but breaks decoder-only generators; optimized INT4 kernels

January 27, 20238 min

Overview

Decision SnapshotReady For Pilot

Strong engineering and empirical results for encoder inference on Ampere GPUs. Limited for decoder models and depends on GPU support (CUTLASS).

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 40%

Authors

Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

On Ampere GPUs, INT4 computation can sharply reduce latency and cost for encoder-based workloads (search, classification, embedding). But it is risky to use for autoregressive generation (chatbots, text generation) until activation-quantization problems are solved.

Who Should Care

Summary TLDR

The paper shows 4-bit weight+activation (W4A4) quantization can keep accuracy for encoder (BERT) and encoder-decoder (BART) models while enabling large latency wins with optimized GPU kernels. Decoder-only models (GPT) suffer large quality loss from 4-bit activation quantization. The authors release a tuned INT4 encoder inference pipeline (CUTLASS-based) that achieves up to 8.5× latency speedup over FP16 and improves prior INT8 performance by up to 1.7×.

Problem Statement

Can full INT4 computation (weights and activations) be used for transformer inference to double hardware throughput and reduce latency, without unacceptable quality loss? And how to implement fast, end-to-end INT4 inference on GPUs?

Main Contribution

System: an end-to-end, highly optimized INT4 encoder inference pipeline (CUTLASS kernels, fused quant/dequant, FlashAttention, CUDA graph).

Empirical: broad QAT+KD study of W4A4 across model types showing encoder and encoder-decoder models tolerate W4A4, decoder-only models do not.

Key Findings

Encoder models (BERT) keep accuracy under W4A4 QAT+KD.

NumbersBERT-base MNLI 84.20 (FP32) → 84.31 (W4A4 symmetric)

Practical UseYou can deploy BERT with 4-bit weights+activations and expect no measurable drop on common classification tasks; try W4A4 to cut memory and unlock INT4 kernels.

Evidence RefTable 1; Section 3.2

Encoder-decoder models (BART) show only small quality drops under W4A4.

NumbersBART-base RLsum 42.87 (FP32) → 41.92 (W4A4 symmetric), drop ≤1 point

Practical UseW4A4 is viable for summarization with minor quality loss; evaluate on your summarization dataset before full rollout.

Evidence RefTable 1; Section 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy84.20 (FP32) → 84.31 (W4A4 symmetric)FP32+0.11MNLI validationTable 1 (BERT-base MNLI)Table 1
BART-base Rouge Lsum42.87 (FP32) → 41.92 (W4A4 symmetric)FP32-0.95CNNDailyMail validationTable 1 (BART-base)Table 1

What To Try In 7 Days

Benchmark W4A4 encoder inference on a representative bs×seq using the authors' INT4 pipeline or CUTLASS kernels.

If using BERT/BART for classification or summarization, run QAT+KD with W4A4 on a held-out dataset and measure quality vs FP16.

Do not quantize GPT activations to INT4 yet; test weight-only 4-bit quantization (w4) or mixed W4A8 as a safer step.

Optimization Features

Token Efficiency
Token-wise dynamic activation quantization (min/max per token)
Infra Optimization
Targeted to NVIDIA Ampere GPUs (A6000); relies on CUTLASS INT4 support
Model Optimization
W4A4 quantization (weights+activations) via QAT+KDgroup-wise row quantization for weights
System Optimization
Pre-tuned GEMM schedules with CUTLASS profilerPacking INT4 into INT8 tensors for current PyTorch support
Training Optimization
Quantization-aware training with knowledge distillationexhaustive hyperparameter search per model
Inference Optimization
Custom CUTLASS INT4 GEMM kernelsFused quantize/dequantize kernels to avoid extra memory trafficFlashAttention integration for FP16 attentionCUDA graph to reduce kernel launch overheadPer-GEMM tunable quantization strategy (modular enable/disable)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Public datasets (MNLI, QQP, CNNDailyMail, XSum, PTB, Wikitext-2/103)

Risks & Boundaries

Limitations

W4A4 fails or degrades decoding/generation (GPT) due to activation quantization sensitivity.

Results target NVIDIA Ampere GPUs and CUTLASS INT4; other hardware may not match gains.

When Not To Use

Autoregressive text generation (GPT-style) where generation quality matters.

Non-Ampere GPUs or hardware without efficient INT4 support.

Failure Modes

Activation quantization causes large early-token perplexity spikes in GPT (positional PPL gap >100 on early tokens).

Pretrained models can have wider activation ranges, making quantization harder than training-from-scratch.

Core Entities

Models

BERT-baseBERT-largeBART-baseBART-largeGPT2-baseGPT2-medium

Metrics

AccuracyF1Rouge LsumPerplexityLatency speedup

Datasets

MNLIQQPCNNDailyMailXSumPTBWikitext-2Wikitext-103GLUE

Benchmarks

GLUESummarization (CNNDailyMail, XSum)Causal language modeling (PTB, Wikitext)