Survey of pruning, quantization, distillation and low-cost methods for compressing modern LLMs

Overview

Decision SnapshotReady For Pilot

The survey synthesizes many public results and highlights practical, tested algorithms (SparseGPT, OPTQ, LoRA); recommendations are evidence-based but depend on reported benchmarks and hardware specifics.

Citations3

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 90%

Production readiness: 70%

Novelty: 30%

Authors

Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang

Links

Abstract / PDF

Why It Matters For Business

Compressing LLMs cuts hosting and inference costs and enables deployment on cheaper hardware; low-cost, post-training methods make this feasible without retraining large models.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Founder

Summary TLDR

This paper surveys methods to make large language models smaller and cheaper. It covers pruning, quantization, knowledge distillation, low-rank methods, parameter sharing and efficient architectures. The authors highlight two priorities for practical work: (1) focus on low-cost algorithms that work on huge LLMs, and (2) prefer iterative, task-aware optimization over single-shot layer reconstruction. They also analyze three representative algorithms (SparseGPT for pruning, OPTQ for quantization, LoRA for low-rank adaptation) and list open research directions for activation quantization, accurate LLM pruning, and combining methods.

Problem Statement

Large pretrained language models cost a lot to store and run. The field has many compression algorithms, but practitioners struggle to pick methods that scale to billion-parameter LLMs and keep accuracy while staying cheap to apply.

Main Contribution

Wide taxonomy and summary of compression methods including pruning, quantization, distillation, low-rank approximation, parameter sharing, and architecture changes.

In-depth analyses of three representative, practical algorithms: SparseGPT (pruning), OPTQ (quantization), and LoRA (low-rank adaptors).

Key Findings

Low-cost, post-training methods now enable compression of very large LLMs without full retraining.

NumbersSparseGPT prunes a 175B model in ~3 hours on one A100

Practical UseIf you lack resources to fine-tune, prefer low-cost columnwise OBS-based methods (e.g., SparseGPT) for large models; plan for iterative calibration for better accuracy.

Evidence RefSection 3.4

Weight-only post-training quantization (OPTQ family) compresses LLMs to low bit widths with small quality loss.

NumbersOPTQ: OPT-175B perplexity 8.34→8.68 and ~3.2× speedup on table results

Practical UseUse OPTQ (or OWQ/RPTQ variants) to cut memory costs quickly; combine with outlier handling for better accuracy.

Evidence RefSection 4.4, Table 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Prune-scale runtime	Pruned OPT-175B in ~3 hours on single A100 (SparseGPT)	unpruned OPT-175B	runtime to prune only (not inference)	—	SparseGPT prunes 175B LLMs in 3 hours on one A100	Section 3.4
Inference speedup (structured pruning)	up to 15×	uncompressed BERT	-3.97% avg accuracy (on reported tasks)	MNLI/QQP/SQuAD aggregate (Table 2)	ZipLM reports 15× speedup with ~-3.97% avg accuracy drop	Table 2

What To Try In 7 Days

Run weight-only OPTQ (or GPTQ/OPTQ) on a production model to cut memory use and measure task-level quality.

If fine-tuning is needed, test LoRA to reduce GPU memory and speed up iteration.

If inference cost dominates, prototype structured pruning (coarse granularity) on a dev model and measure real latency on target hardware.

Optimization Features

Token Efficiency

token pruning (PoWER-BERT, LTP)

Infra Optimization

A100-optimized sparsity patterns (2:4)mixed-precision & hardware-aware deployment

Model Optimization

pruningquantizationlow_rank_approximationparameter_sharingefficient_architecture_design

System Optimization

column-wise compensation (SparseGPT/OPTQ)block-wise and hardware-aware pruning (ZipLM)

Training Optimization

LoRAknowledge_distillationquantization-aware training (QAT)lightweight QAT (partial parameter updates)

Inference Optimization

weight-only quantizationstructured pruning (sublayer/layer, 2:4 sparsity)outlier-aware quantization (OWQ, RPTQ)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Survey compiles existing papers; no new experiments to compare all methods under a single protocol.

Emphasis on low-cost methods may under-cover some high-cost but high-accuracy approaches.

When Not To Use

If you require best possible accuracy and can afford large-scale retraining, prefer specialized QAT/KD/LRA retraining pipelines.

Do not rely on aggressive activation quantization (<8-bit) for critical tasks without validation.

Failure Modes

High compression can cause large task-specific accuracy drops, especially for decoder-only LLMs.

Activation outliers break naive PTQ, causing large performance regressions.

Core Entities

Models

BERTRoBERTaGPT-2OPTLLaMABLOOMDynaBERTALBERTGPT-3

Metrics

perplexityAccuracyF1FLOPs reductioninference speedupmemory footprint

Datasets

MNLIQQPSQuAD 1.1WikiText2

Benchmarks

perplexity (WikiText2)Accuracy

Context Entities

Models

ZipLMSparseGPTOPTQOWQRPTQLoRAKpruneKCMCoFi

Metrics

Accuracy

Datasets

small calibration sets (used in PTQ/PTP)task fine-tuning corpora (for QAT/KD)

Benchmarks

MNLI, QQP, SQuAD, WikiText2 (as used in comparative tables)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Low-cost, post-training methods now enable compression of very large LLMs without full retraining.

Weight-only post-training quantization (OPTQ family) compresses LLMs to low bit widths with small quality loss.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding