Overview
Production Readiness
0.75
Novelty Score
0.4
Cost Impact Score
0.85
Citation Count
13
Why It Matters For Business
Compression makes large Transformers affordable to run and store: use post-training quantization and structured pruning for immediate cost and latency gains without full retraining.
Summary TLDR
This paper surveys methods to make Transformer models smaller and faster. It covers four main classes—quantization (lower-precision numbers), pruning (remove weights/heads/tokens), knowledge distillation (train small models from big ones), and efficient architectures (attention alternatives and SSMs). The survey compares practical results (ImageNet, WikiText2, GLUE), highlights training-cost trade-offs, and recommends combining light retraining with post-training tools for large models.
Problem Statement
Large Transformers (LLMs and LVMs) are costly to store and run. Direct retraining is usually infeasible for billion-parameter models, so practical compression methods must cut memory and compute while keeping accuracy and limiting retraining cost.
Main Contribution
Comprehensive taxonomy of Transformer compression methods: quantization, pruning, distillation, architecture.
Side-by-side summaries and representative methods for NLP and vision (tables with ImageNet, WikiText2, GLUE results).
Practical discussion of training-efficiency, how methods combine, and future directions like SSMs and low-cost post-training pipelines.
Key Findings
8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.
Careful per-channel scaling and outlier handling recover much of quantized LLM accuracy.
Pruning large LMs can reach very high sparsity but gives limited wall-clock speedups unless structured pruning is used.
Knowledge distillation yields big inference speedups with modest accuracy loss on standard NLP benchmarks.
Results
Accuracy
Accuracy
WikiText2 perplexity (LLaMA 7B)
WikiText2 perplexity after OmniQuant
Pruning sparsity and speedup (OPT-175B via SparseGPT)
Who Should Care
What To Try In 7 Days
Run INT8 PTQ (SmoothQuant/GPTQ/OmniQuant) on your FP16 model and measure latency/memory.
Apply structured pruning to create a hardware-friendly smaller model shape and test throughput.
Distill a smaller student using teacher outputs for your key task to cut inference cost quickly.
Optimization Features
Token Efficiency
- dynamic context/token pruning
- sparse attention and global+local attention
Infra Optimization
- use of FasterTransformer, Megatron, Deepspeed for large-model throughput
Model Optimization
- quantization (PTQ and QAT)
- structured and unstructured pruning
- knowledge distillation
- efficient architecture (SSMs, linear attention, MoE)
System Optimization
- mixed-precision and per-channel scale migration
- hardware-aware architecture choices (Mamba, RetNet, RWKV)
Training Optimization
- LoRA
- post-training calibration with limited data
- cross-block reconstruction for PTQ
Inference Optimization
- INT8 inference
- FlashAttention and IO-aware kernels
- speculative sampling for decoding
- early exiting and token pruning
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Survey summarizes many methods but cannot exhaustively benchmark them under the same setup.
- Performance numbers come from original papers with heterogeneous protocols and hardware.
- Rapidly evolving tools (post-2023) may change best-practice recommendations quickly.
When Not To Use
- If you require exact reproducible low-level implementation details—this is a survey, not a code release.
- If you need definitive single-benchmark comparisons—differences in eval protocol limit direct comparison.
Failure Modes
- Aggressive low-bit quantization (<=4-bit) can cause large accuracy drops unless carefully tuned or retrained.
- Unstructured sparsity reduces memory but may not give latency gains on commodity hardware.
- Distillation can inherit teacher biases and may fail if teacher outputs are unavailable (closed API).
Core Entities
Models
- Transformer
- ViT
- DeiT
- LLaMA
- GPT-3
- GPT-4
- OPT
- S4
- Hyena
- RetNet
- RWKV
- Mamba
Metrics
- Accuracy
- Perplexity (PPL)
- GLUE score
- Inference latency
- Speedup ratio
- Compression rate (sparsity %)
Datasets
- ImageNet-1k
- WikiText2
- GLUE
- ImageNet-21k
- BookCorpus
Benchmarks
- Accuracy
- WikiText2 perplexity
- GLUE average score
Context Entities
Models
- BERT
- DistilBERT
- TinyBERT
- MobileBERT
- PaLM
- Swin
- CLIP
- BLIP
Metrics
- Memory footprint (GB)
- FLOPs
- Throughput
Datasets
- ImageNet-1k/21k
- Common Crawl (for LLM pretrain)
Benchmarks
- GLUE tasks
- SQuAD2

