Overview
Production Readiness
0.7
Novelty Score
0.3
Cost Impact Score
0.9
Citation Count
3
Why It Matters For Business
Compressing LLMs cuts hosting and inference costs and enables deployment on cheaper hardware; low-cost, post-training methods make this feasible without retraining large models.
Summary TLDR
This paper surveys methods to make large language models smaller and cheaper. It covers pruning, quantization, knowledge distillation, low-rank methods, parameter sharing and efficient architectures. The authors highlight two priorities for practical work: (1) focus on low-cost algorithms that work on huge LLMs, and (2) prefer iterative, task-aware optimization over single-shot layer reconstruction. They also analyze three representative algorithms (SparseGPT for pruning, OPTQ for quantization, LoRA for low-rank adaptation) and list open research directions for activation quantization, accurate LLM pruning, and combining methods.
Problem Statement
Large pretrained language models cost a lot to store and run. The field has many compression algorithms, but practitioners struggle to pick methods that scale to billion-parameter LLMs and keep accuracy while staying cheap to apply.
Main Contribution
Wide taxonomy and summary of compression methods including pruning, quantization, distillation, low-rank approximation, parameter sharing, and architecture changes.
In-depth analyses of three representative, practical algorithms: SparseGPT (pruning), OPTQ (quantization), and LoRA (low-rank adaptors).
Discussion of desired properties for low-cost compression and a list of research directions for LLM-focused compression.
Key Findings
Low-cost, post-training methods now enable compression of very large LLMs without full retraining.
Weight-only post-training quantization (OPTQ family) compresses LLMs to low bit widths with small quality loss.
Pruning can give large runtime speedups but accuracy and hardware acceleration trade off with pruning granularity.
Low-rank adaptors (LoRA/PEFT) dramatically cut fine-tuning memory while preserving accuracy.
Activation quantization below 8 bits remains fragile and causes major quality drops without special handling.
Results
Prune-scale runtime
Inference speedup (structured pruning)
Weight-only quantization performance
Fine-tuning memory reduction
Extreme low-rank compression
Who Should Care
What To Try In 7 Days
Run weight-only OPTQ (or GPTQ/OPTQ) on a production model to cut memory use and measure task-level quality.
If fine-tuning is needed, test LoRA to reduce GPU memory and speed up iteration.
If inference cost dominates, prototype structured pruning (coarse granularity) on a dev model and measure real latency on target hardware.
Optimization Features
Token Efficiency
- token pruning (PoWER-BERT, LTP)
Infra Optimization
- A100-optimized sparsity patterns (2:4)
- mixed-precision & hardware-aware deployment
Model Optimization
- pruning
- quantization
- low_rank_approximation
- parameter_sharing
- efficient_architecture_design
System Optimization
- column-wise compensation (SparseGPT/OPTQ)
- block-wise and hardware-aware pruning (ZipLM)
Training Optimization
- LoRA
- knowledge_distillation
- quantization-aware training (QAT)
- lightweight QAT (partial parameter updates)
Inference Optimization
- weight-only quantization
- structured pruning (sublayer/layer, 2:4 sparsity)
- outlier-aware quantization (OWQ, RPTQ)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Survey compiles existing papers; no new experiments to compare all methods under a single protocol.
- Emphasis on low-cost methods may under-cover some high-cost but high-accuracy approaches.
- Hardware-specific speedups depend on target accelerator and software stack.
When Not To Use
- If you require best possible accuracy and can afford large-scale retraining, prefer specialized QAT/KD/LRA retraining pipelines.
- Do not rely on aggressive activation quantization (<8-bit) for critical tasks without validation.
- Avoid one-shot low-cost pruning for high-compression targets without iterative compensation.
Failure Modes
- High compression can cause large task-specific accuracy drops, especially for decoder-only LLMs.
- Activation outliers break naive PTQ, causing large performance regressions.
- Mixed-precision schemes can hurt inference throughput on certain hardware stacks.
Core Entities
Models
- BERT
- RoBERTa
- GPT-2
- OPT
- LLaMA
- BLOOM
- DynaBERT
- ALBERT
- GPT-3
Metrics
- perplexity
- Accuracy
- F1
- FLOPs reduction
- inference speedup
- memory footprint
Datasets
- MNLI
- QQP
- SQuAD 1.1
- WikiText2
Benchmarks
- perplexity (WikiText2)
- Accuracy
Context Entities
Models
- ZipLM
- SparseGPT
- OPTQ
- OWQ
- RPTQ
- LoRA
- Kprune
- KCM
- CoFi
Metrics
- Accuracy
Datasets
- small calibration sets (used in PTQ/PTP)
- task fine-tuning corpora (for QAT/KD)
Benchmarks
- MNLI, QQP, SQuAD, WikiText2 (as used in comparative tables)

