Practical guide to compressing Transformers: quantization, pruning, distillation and efficient architectures

February 5, 20247 min

Overview

Decision SnapshotReady For Pilot

The survey aggregates many empirical results and practical recipes; recommendations are grounded in cited benchmarks but the field moves quickly.

Citations13

Evidence Strength0.80

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 85%

Production readiness: 75%

Novelty: 40%

Authors

Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao

Links

Abstract / PDF

Why It Matters For Business

Compression makes large Transformers affordable to run and store: use post-training quantization and structured pruning for immediate cost and latency gains without full retraining.

Who Should Care

Summary TLDR

This paper surveys methods to make Transformer models smaller and faster. It covers four main classes—quantization (lower-precision numbers), pruning (remove weights/heads/tokens), knowledge distillation (train small models from big ones), and efficient architectures (attention alternatives and SSMs). The survey compares practical results (ImageNet, WikiText2, GLUE), highlights training-cost trade-offs, and recommends combining light retraining with post-training tools for large models.

Problem Statement

Large Transformers (LLMs and LVMs) are costly to store and run. Direct retraining is usually infeasible for billion-parameter models, so practical compression methods must cut memory and compute while keeping accuracy and limiting retraining cost.

Main Contribution

Comprehensive taxonomy of Transformer compression methods: quantization, pruning, distillation, architecture.

Side-by-side summaries and representative methods for NLP and vision (tables with ImageNet, WikiText2, GLUE results).

Key Findings

8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.

NumbersTable 2: ViT-B top-1 84.54 (FP) vs 8-bit PTQ 76.98, 6-bit PTQ 75.26/81.65 depending on method

Practical UseUse PTQ for 8/6-bit; expect to need QAT or specialized PTQ for reliable <=4-bit results.

Evidence RefTable 2

Careful per-channel scaling and outlier handling recover much of quantized LLM accuracy.

NumbersTable 3: LLaMA 7B FP16 PPL 5.68; OmniQuant 6/6 PPL 5.96; OS+ 6/6 PPL 5.76

Practical UseWhen quantizing LLMs, apply per-channel scaling and outlier suppression before expensive retraining.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy84.54 (FP16)ImageNet-1kTable 2 full-precision baselineTable 2
Accuracy76.98 (8-bit PTQ-ViT)84.54 (FP)-7.56ImageNet-1kTable 2 PTQ-ViT resultTable 2

What To Try In 7 Days

Run INT8 PTQ (SmoothQuant/GPTQ/OmniQuant) on your FP16 model and measure latency/memory.

Apply structured pruning to create a hardware-friendly smaller model shape and test throughput.

Distill a smaller student using teacher outputs for your key task to cut inference cost quickly.

Optimization Features

Token Efficiency
dynamic context/token pruningsparse attention and global+local attention
Infra Optimization
use of FasterTransformer, Megatron, Deepspeed for large-model throughput
Model Optimization
quantization (PTQ and QAT)structured and unstructured pruningknowledge distillationefficient architecture (SSMs, linear attention, MoE)
System Optimization
mixed-precision and per-channel scale migrationhardware-aware architecture choices (Mamba, RetNet, RWKV)
Training Optimization
LoRApost-training calibration with limited datacross-block reconstruction for PTQ
Inference Optimization
INT8 inferenceFlashAttention and IO-aware kernelsspeculative sampling for decodingearly exiting and token pruning

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Survey summarizes many methods but cannot exhaustively benchmark them under the same setup.

Performance numbers come from original papers with heterogeneous protocols and hardware.

When Not To Use

If you require exact reproducible low-level implementation details—this is a survey, not a code release.

If you need definitive single-benchmark comparisons—differences in eval protocol limit direct comparison.

Failure Modes

Aggressive low-bit quantization (<=4-bit) can cause large accuracy drops unless carefully tuned or retrained.

Unstructured sparsity reduces memory but may not give latency gains on commodity hardware.

Core Entities

Models

TransformerViTDeiTLLaMAGPT-3GPT-4OPTS4HyenaRetNetRWKVMamba

Metrics

AccuracyPerplexity (PPL)GLUE scoreInference latencySpeedup ratioCompression rate (sparsity %)

Datasets

ImageNet-1kWikiText2GLUEImageNet-21kBookCorpus

Benchmarks

AccuracyWikiText2 perplexityGLUE average score

Context Entities

Models

BERTDistilBERTTinyBERTMobileBERTPaLMSwinCLIPBLIP

Metrics

Memory footprint (GB)FLOPsThroughput

Datasets

ImageNet-1k/21kCommon Crawl (for LLM pretrain)

Benchmarks

GLUE tasksSQuAD2