Overview
The survey aggregates many empirical results and practical recipes; recommendations are grounded in cited benchmarks but the field moves quickly.
Citations13
Evidence Strength0.80
Confidence0.88
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 85%
Production readiness: 75%
Novelty: 40%
Why It Matters For Business
Compression makes large Transformers affordable to run and store: use post-training quantization and structured pruning for immediate cost and latency gains without full retraining.
Who Should Care
Summary TLDR
This paper surveys methods to make Transformer models smaller and faster. It covers four main classes—quantization (lower-precision numbers), pruning (remove weights/heads/tokens), knowledge distillation (train small models from big ones), and efficient architectures (attention alternatives and SSMs). The survey compares practical results (ImageNet, WikiText2, GLUE), highlights training-cost trade-offs, and recommends combining light retraining with post-training tools for large models.
Problem Statement
Large Transformers (LLMs and LVMs) are costly to store and run. Direct retraining is usually infeasible for billion-parameter models, so practical compression methods must cut memory and compute while keeping accuracy and limiting retraining cost.
Main Contribution
Comprehensive taxonomy of Transformer compression methods: quantization, pruning, distillation, architecture.
Side-by-side summaries and representative methods for NLP and vision (tables with ImageNet, WikiText2, GLUE results).
Key Findings
8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.
Careful per-channel scaling and outlier handling recover much of quantized LLM accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 84.54 (FP16) | — | — | ImageNet-1k | Table 2 full-precision baseline | Table 2 |
| Accuracy | 76.98 (8-bit PTQ-ViT) | 84.54 (FP) | -7.56 | ImageNet-1k | Table 2 PTQ-ViT result | Table 2 |
What To Try In 7 Days
Run INT8 PTQ (SmoothQuant/GPTQ/OmniQuant) on your FP16 model and measure latency/memory.
Apply structured pruning to create a hardware-friendly smaller model shape and test throughput.
Distill a smaller student using teacher outputs for your key task to cut inference cost quickly.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Survey summarizes many methods but cannot exhaustively benchmark them under the same setup.
Performance numbers come from original papers with heterogeneous protocols and hardware.
When Not To Use
If you require exact reproducible low-level implementation details—this is a survey, not a code release.
If you need definitive single-benchmark comparisons—differences in eval protocol limit direct comparison.
Failure Modes
Aggressive low-bit quantization (<=4-bit) can cause large accuracy drops unless carefully tuned or retrained.
Unstructured sparsity reduces memory but may not give latency gains on commodity hardware.

