Practical guide to compressing Transformers: quantization, pruning, distillation and efficient architectures

Overview

Decision SnapshotReady For Pilot

The survey aggregates many empirical results and practical recipes; recommendations are grounded in cited benchmarks but the field moves quickly.

Citations13

Evidence Strength0.80

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 85%

Production readiness: 75%

Novelty: 40%

Authors

Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao

Links

Abstract / PDF

Why It Matters For Business

Compression makes large Transformers affordable to run and store: use post-training quantization and structured pruning for immediate cost and latency gains without full retraining.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

This paper surveys methods to make Transformer models smaller and faster. It covers four main classes—quantization (lower-precision numbers), pruning (remove weights/heads/tokens), knowledge distillation (train small models from big ones), and efficient architectures (attention alternatives and SSMs). The survey compares practical results (ImageNet, WikiText2, GLUE), highlights training-cost trade-offs, and recommends combining light retraining with post-training tools for large models.

Problem Statement

Large Transformers (LLMs and LVMs) are costly to store and run. Direct retraining is usually infeasible for billion-parameter models, so practical compression methods must cut memory and compute while keeping accuracy and limiting retraining cost.

Main Contribution

Comprehensive taxonomy of Transformer compression methods: quantization, pruning, distillation, architecture.

Side-by-side summaries and representative methods for NLP and vision (tables with ImageNet, WikiText2, GLUE results).

Key Findings

8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.

NumbersTable 2: ViT-B top-1 84.54 (FP) vs 8-bit PTQ 76.98, 6-bit PTQ 75.26/81.65 depending on method

Practical UseUse PTQ for 8/6-bit; expect to need QAT or specialized PTQ for reliable <=4-bit results.

Evidence RefTable 2

Careful per-channel scaling and outlier handling recover much of quantized LLM accuracy.

NumbersTable 3: LLaMA 7B FP16 PPL 5.68; OmniQuant 6/6 PPL 5.96; OS+ 6/6 PPL 5.76

Practical UseWhen quantizing LLMs, apply per-channel scaling and outlier suppression before expensive retraining.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	84.54 (FP16)	—	—	ImageNet-1k	Table 2 full-precision baseline	Table 2
Accuracy	76.98 (8-bit PTQ-ViT)	84.54 (FP)	-7.56	ImageNet-1k	Table 2 PTQ-ViT result	Table 2

What To Try In 7 Days

Run INT8 PTQ (SmoothQuant/GPTQ/OmniQuant) on your FP16 model and measure latency/memory.

Apply structured pruning to create a hardware-friendly smaller model shape and test throughput.

Distill a smaller student using teacher outputs for your key task to cut inference cost quickly.

Optimization Features

Token Efficiency

dynamic context/token pruningsparse attention and global+local attention

Infra Optimization

use of FasterTransformer, Megatron, Deepspeed for large-model throughput

Model Optimization

quantization (PTQ and QAT)structured and unstructured pruningknowledge distillationefficient architecture (SSMs, linear attention, MoE)

System Optimization

mixed-precision and per-channel scale migrationhardware-aware architecture choices (Mamba, RetNet, RWKV)

Training Optimization

LoRApost-training calibration with limited datacross-block reconstruction for PTQ

Inference Optimization

INT8 inferenceFlashAttention and IO-aware kernelsspeculative sampling for decodingearly exiting and token pruning

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Survey summarizes many methods but cannot exhaustively benchmark them under the same setup.

Performance numbers come from original papers with heterogeneous protocols and hardware.

When Not To Use

If you require exact reproducible low-level implementation details—this is a survey, not a code release.

If you need definitive single-benchmark comparisons—differences in eval protocol limit direct comparison.

Failure Modes

Aggressive low-bit quantization (<=4-bit) can cause large accuracy drops unless carefully tuned or retrained.

Unstructured sparsity reduces memory but may not give latency gains on commodity hardware.

Core Entities

Models

TransformerViTDeiTLLaMAGPT-3GPT-4OPTS4HyenaRetNetRWKVMamba

Metrics

AccuracyPerplexity (PPL)GLUE scoreInference latencySpeedup ratioCompression rate (sparsity %)

Datasets

ImageNet-1kWikiText2GLUEImageNet-21kBookCorpus

Benchmarks

AccuracyWikiText2 perplexityGLUE average score

Context Entities

Models

BERTDistilBERTTinyBERTMobileBERTPaLMSwinCLIPBLIP

Metrics

Memory footprint (GB)FLOPsThroughput

Datasets

ImageNet-1k/21kCommon Crawl (for LLM pretrain)

Benchmarks

GLUE tasksSQuAD2

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.

Careful per-channel scaling and outlier handling recover much of quantized LLM accuracy.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding