Overview
The survey synthesizes many public results and highlights practical, tested algorithms (SparseGPT, OPTQ, LoRA); recommendations are evidence-based but depend on reported benchmarks and hardware specifics.
Citations3
Evidence Strength0.80
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 90%
Production readiness: 70%
Novelty: 30%
Why It Matters For Business
Compressing LLMs cuts hosting and inference costs and enables deployment on cheaper hardware; low-cost, post-training methods make this feasible without retraining large models.
Who Should Care
Summary TLDR
This paper surveys methods to make large language models smaller and cheaper. It covers pruning, quantization, knowledge distillation, low-rank methods, parameter sharing and efficient architectures. The authors highlight two priorities for practical work: (1) focus on low-cost algorithms that work on huge LLMs, and (2) prefer iterative, task-aware optimization over single-shot layer reconstruction. They also analyze three representative algorithms (SparseGPT for pruning, OPTQ for quantization, LoRA for low-rank adaptation) and list open research directions for activation quantization, accurate LLM pruning, and combining methods.
Problem Statement
Large pretrained language models cost a lot to store and run. The field has many compression algorithms, but practitioners struggle to pick methods that scale to billion-parameter LLMs and keep accuracy while staying cheap to apply.
Main Contribution
Wide taxonomy and summary of compression methods including pruning, quantization, distillation, low-rank approximation, parameter sharing, and architecture changes.
In-depth analyses of three representative, practical algorithms: SparseGPT (pruning), OPTQ (quantization), and LoRA (low-rank adaptors).
Key Findings
Low-cost, post-training methods now enable compression of very large LLMs without full retraining.
Weight-only post-training quantization (OPTQ family) compresses LLMs to low bit widths with small quality loss.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Prune-scale runtime | Pruned OPT-175B in ~3 hours on single A100 (SparseGPT) | unpruned OPT-175B | runtime to prune only (not inference) | — | SparseGPT prunes 175B LLMs in 3 hours on one A100 | Section 3.4 |
| Inference speedup (structured pruning) | up to 15× | uncompressed BERT | -3.97% avg accuracy (on reported tasks) | MNLI/QQP/SQuAD aggregate (Table 2) | ZipLM reports 15× speedup with ~-3.97% avg accuracy drop | Table 2 |
What To Try In 7 Days
Run weight-only OPTQ (or GPTQ/OPTQ) on a production model to cut memory use and measure task-level quality.
If fine-tuning is needed, test LoRA to reduce GPU memory and speed up iteration.
If inference cost dominates, prototype structured pruning (coarse granularity) on a dev model and measure real latency on target hardware.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Survey compiles existing papers; no new experiments to compare all methods under a single protocol.
Emphasis on low-cost methods may under-cover some high-cost but high-accuracy approaches.
When Not To Use
If you require best possible accuracy and can afford large-scale retraining, prefer specialized QAT/KD/LRA retraining pipelines.
Do not rely on aggressive activation quantization (<8-bit) for critical tasks without validation.
Failure Modes
High compression can cause large task-specific accuracy drops, especially for decoder-only LLMs.
Activation outliers break naive PTQ, causing large performance regressions.

