Practical guide to compressing Transformers: quantization, pruning, distillation and efficient architectures

February 5, 20247 min

Overview

Production Readiness

0.75

Novelty Score

0.4

Cost Impact Score

0.85

Citation Count

13

Authors

Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao

Links

Abstract / PDF

Why It Matters For Business

Compression makes large Transformers affordable to run and store: use post-training quantization and structured pruning for immediate cost and latency gains without full retraining.

Summary TLDR

This paper surveys methods to make Transformer models smaller and faster. It covers four main classes—quantization (lower-precision numbers), pruning (remove weights/heads/tokens), knowledge distillation (train small models from big ones), and efficient architectures (attention alternatives and SSMs). The survey compares practical results (ImageNet, WikiText2, GLUE), highlights training-cost trade-offs, and recommends combining light retraining with post-training tools for large models.

Problem Statement

Large Transformers (LLMs and LVMs) are costly to store and run. Direct retraining is usually infeasible for billion-parameter models, so practical compression methods must cut memory and compute while keeping accuracy and limiting retraining cost.

Main Contribution

Comprehensive taxonomy of Transformer compression methods: quantization, pruning, distillation, architecture.

Side-by-side summaries and representative methods for NLP and vision (tables with ImageNet, WikiText2, GLUE results).

Practical discussion of training-efficiency, how methods combine, and future directions like SSMs and low-cost post-training pipelines.

Key Findings

8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.

NumbersTable 2: ViT-B top-1 84.54 (FP) vs 8-bit PTQ 76.98, 6-bit PTQ 75.26/81.65 depending on method

Careful per-channel scaling and outlier handling recover much of quantized LLM accuracy.

NumbersTable 3: LLaMA 7B FP16 PPL 5.68; OmniQuant 6/6 PPL 5.96; OS+ 6/6 PPL 5.76

Pruning large LMs can reach very high sparsity but gives limited wall-clock speedups unless structured pruning is used.

NumbersTable 6: SparseGPT 50–60% sparsity -> 1.54–1.79× GPU speedup; structured pruning yields ~1.18× at 20% compression

Knowledge distillation yields big inference speedups with modest accuracy loss on standard NLP benchmarks.

NumbersTable 4: DistilBERT ~×3 speedup; TinyBERT ~×9.4 speedup with GLUE 76.5 vs BERT base 79.6

Results

Accuracy

Value84.54 (FP16)

Accuracy

Value76.98 (8-bit PTQ-ViT)

Baseline84.54 (FP)

WikiText2 perplexity (LLaMA 7B)

Value5.68 (FP16)

WikiText2 perplexity after OmniQuant

Value5.96 (6-bit weights/6-bit activations)

Baseline5.68 (FP16)

Pruning sparsity and speedup (OPT-175B via SparseGPT)

Value50–60% sparsity -> 1.54–1.79× GPU speedup

Who Should Care

What To Try In 7 Days

Run INT8 PTQ (SmoothQuant/GPTQ/OmniQuant) on your FP16 model and measure latency/memory.

Apply structured pruning to create a hardware-friendly smaller model shape and test throughput.

Distill a smaller student using teacher outputs for your key task to cut inference cost quickly.

Optimization Features

Token Efficiency

  • dynamic context/token pruning
  • sparse attention and global+local attention

Infra Optimization

  • use of FasterTransformer, Megatron, Deepspeed for large-model throughput

Model Optimization

  • quantization (PTQ and QAT)
  • structured and unstructured pruning
  • knowledge distillation
  • efficient architecture (SSMs, linear attention, MoE)

System Optimization

  • mixed-precision and per-channel scale migration
  • hardware-aware architecture choices (Mamba, RetNet, RWKV)

Training Optimization

  • LoRA
  • post-training calibration with limited data
  • cross-block reconstruction for PTQ

Inference Optimization

  • INT8 inference
  • FlashAttention and IO-aware kernels
  • speculative sampling for decoding
  • early exiting and token pruning

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Survey summarizes many methods but cannot exhaustively benchmark them under the same setup.
  • Performance numbers come from original papers with heterogeneous protocols and hardware.
  • Rapidly evolving tools (post-2023) may change best-practice recommendations quickly.

When Not To Use

  • If you require exact reproducible low-level implementation details—this is a survey, not a code release.
  • If you need definitive single-benchmark comparisons—differences in eval protocol limit direct comparison.

Failure Modes

  • Aggressive low-bit quantization (<=4-bit) can cause large accuracy drops unless carefully tuned or retrained.
  • Unstructured sparsity reduces memory but may not give latency gains on commodity hardware.
  • Distillation can inherit teacher biases and may fail if teacher outputs are unavailable (closed API).

Core Entities

Models

  • Transformer
  • ViT
  • DeiT
  • LLaMA
  • GPT-3
  • GPT-4
  • OPT
  • S4
  • Hyena
  • RetNet
  • RWKV
  • Mamba

Metrics

  • Accuracy
  • Perplexity (PPL)
  • GLUE score
  • Inference latency
  • Speedup ratio
  • Compression rate (sparsity %)

Datasets

  • ImageNet-1k
  • WikiText2
  • GLUE
  • ImageNet-21k
  • BookCorpus

Benchmarks

  • Accuracy
  • WikiText2 perplexity
  • GLUE average score

Context Entities

Models

  • BERT
  • DistilBERT
  • TinyBERT
  • MobileBERT
  • PaLM
  • Swin
  • CLIP
  • BLIP

Metrics

  • Memory footprint (GB)
  • FLOPs
  • Throughput

Datasets

  • ImageNet-1k/21k
  • Common Crawl (for LLM pretrain)

Benchmarks

  • GLUE tasks
  • SQuAD2