Survey of pruning, quantization, distillation and low-cost methods for compressing modern LLMs

January 27, 20247 min

Overview

Decision SnapshotReady For Pilot

The survey synthesizes many public results and highlights practical, tested algorithms (SparseGPT, OPTQ, LoRA); recommendations are evidence-based but depend on reported benchmarks and hardware specifics.

Citations3

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 90%

Production readiness: 70%

Novelty: 30%

Authors

Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang

Links

Abstract / PDF

Why It Matters For Business

Compressing LLMs cuts hosting and inference costs and enables deployment on cheaper hardware; low-cost, post-training methods make this feasible without retraining large models.

Who Should Care

Summary TLDR

This paper surveys methods to make large language models smaller and cheaper. It covers pruning, quantization, knowledge distillation, low-rank methods, parameter sharing and efficient architectures. The authors highlight two priorities for practical work: (1) focus on low-cost algorithms that work on huge LLMs, and (2) prefer iterative, task-aware optimization over single-shot layer reconstruction. They also analyze three representative algorithms (SparseGPT for pruning, OPTQ for quantization, LoRA for low-rank adaptation) and list open research directions for activation quantization, accurate LLM pruning, and combining methods.

Problem Statement

Large pretrained language models cost a lot to store and run. The field has many compression algorithms, but practitioners struggle to pick methods that scale to billion-parameter LLMs and keep accuracy while staying cheap to apply.

Main Contribution

Wide taxonomy and summary of compression methods including pruning, quantization, distillation, low-rank approximation, parameter sharing, and architecture changes.

In-depth analyses of three representative, practical algorithms: SparseGPT (pruning), OPTQ (quantization), and LoRA (low-rank adaptors).

Key Findings

Low-cost, post-training methods now enable compression of very large LLMs without full retraining.

NumbersSparseGPT prunes a 175B model in ~3 hours on one A100

Practical UseIf you lack resources to fine-tune, prefer low-cost columnwise OBS-based methods (e.g., SparseGPT) for large models; plan for iterative calibration for better accuracy.

Evidence RefSection 3.4

Weight-only post-training quantization (OPTQ family) compresses LLMs to low bit widths with small quality loss.

NumbersOPTQ: OPT-175B perplexity 8.348.68 and ~3.2× speedup on table results

Practical UseUse OPTQ (or OWQ/RPTQ variants) to cut memory costs quickly; combine with outlier handling for better accuracy.

Evidence RefSection 4.4, Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Prune-scale runtimePruned OPT-175B in ~3 hours on single A100 (SparseGPT)unpruned OPT-175Bruntime to prune only (not inference)SparseGPT prunes 175B LLMs in 3 hours on one A100Section 3.4
Inference speedup (structured pruning)up to 15×uncompressed BERT-3.97% avg accuracy (on reported tasks)MNLI/QQP/SQuAD aggregate (Table 2)ZipLM reports 15× speedup with ~-3.97% avg accuracy dropTable 2

What To Try In 7 Days

Run weight-only OPTQ (or GPTQ/OPTQ) on a production model to cut memory use and measure task-level quality.

If fine-tuning is needed, test LoRA to reduce GPU memory and speed up iteration.

If inference cost dominates, prototype structured pruning (coarse granularity) on a dev model and measure real latency on target hardware.

Optimization Features

Token Efficiency
token pruning (PoWER-BERT, LTP)
Infra Optimization
A100-optimized sparsity patterns (2:4)mixed-precision & hardware-aware deployment
Model Optimization
pruningquantizationlow_rank_approximationparameter_sharingefficient_architecture_design
System Optimization
column-wise compensation (SparseGPT/OPTQ)block-wise and hardware-aware pruning (ZipLM)
Training Optimization
LoRAknowledge_distillationquantization-aware training (QAT)lightweight QAT (partial parameter updates)
Inference Optimization
weight-only quantizationstructured pruning (sublayer/layer, 2:4 sparsity)outlier-aware quantization (OWQ, RPTQ)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Survey compiles existing papers; no new experiments to compare all methods under a single protocol.

Emphasis on low-cost methods may under-cover some high-cost but high-accuracy approaches.

When Not To Use

If you require best possible accuracy and can afford large-scale retraining, prefer specialized QAT/KD/LRA retraining pipelines.

Do not rely on aggressive activation quantization (<8-bit) for critical tasks without validation.

Failure Modes

High compression can cause large task-specific accuracy drops, especially for decoder-only LLMs.

Activation outliers break naive PTQ, causing large performance regressions.

Core Entities

Models

BERTRoBERTaGPT-2OPTLLaMABLOOMDynaBERTALBERTGPT-3

Metrics

perplexityAccuracyF1FLOPs reductioninference speedupmemory footprint

Datasets

MNLIQQPSQuAD 1.1WikiText2

Benchmarks

perplexity (WikiText2)Accuracy

Context Entities

Models

ZipLMSparseGPTOPTQOWQRPTQLoRAKpruneKCMCoFi

Metrics

Accuracy

Datasets

small calibration sets (used in PTQ/PTP)task fine-tuning corpora (for QAT/KD)

Benchmarks

MNLI, QQP, SQuAD, WikiText2 (as used in comparative tables)