Survey of pruning, quantization, distillation and low-cost methods for compressing modern LLMs

January 27, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.3

Cost Impact Score

0.9

Citation Count

3

Authors

Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang

Links

Abstract / PDF

Why It Matters For Business

Compressing LLMs cuts hosting and inference costs and enables deployment on cheaper hardware; low-cost, post-training methods make this feasible without retraining large models.

Summary TLDR

This paper surveys methods to make large language models smaller and cheaper. It covers pruning, quantization, knowledge distillation, low-rank methods, parameter sharing and efficient architectures. The authors highlight two priorities for practical work: (1) focus on low-cost algorithms that work on huge LLMs, and (2) prefer iterative, task-aware optimization over single-shot layer reconstruction. They also analyze three representative algorithms (SparseGPT for pruning, OPTQ for quantization, LoRA for low-rank adaptation) and list open research directions for activation quantization, accurate LLM pruning, and combining methods.

Problem Statement

Large pretrained language models cost a lot to store and run. The field has many compression algorithms, but practitioners struggle to pick methods that scale to billion-parameter LLMs and keep accuracy while staying cheap to apply.

Main Contribution

Wide taxonomy and summary of compression methods including pruning, quantization, distillation, low-rank approximation, parameter sharing, and architecture changes.

In-depth analyses of three representative, practical algorithms: SparseGPT (pruning), OPTQ (quantization), and LoRA (low-rank adaptors).

Discussion of desired properties for low-cost compression and a list of research directions for LLM-focused compression.

Key Findings

Low-cost, post-training methods now enable compression of very large LLMs without full retraining.

NumbersSparseGPT prunes a 175B model in ~3 hours on one A100

Weight-only post-training quantization (OPTQ family) compresses LLMs to low bit widths with small quality loss.

NumbersOPTQ: OPT-175B perplexity 8.34→8.68 and ~3.2× speedup on table results

Pruning can give large runtime speedups but accuracy and hardware acceleration trade off with pruning granularity.

NumbersZipLM reports up to 15× speedup with ~-3.97% avg accuracy drop on encoder models

Low-rank adaptors (LoRA/PEFT) dramatically cut fine-tuning memory while preserving accuracy.

NumbersLoRA reduced fine-tuning memory for GPT-3 175B from ~1.2TB to ~350GB

Activation quantization below 8 bits remains fragile and causes major quality drops without special handling.

NumbersMultiple PTQ studies report activation bit-width as a main unresolved issue; RPTQ/OWQ partially address outliers

Results

Prune-scale runtime

ValuePruned OPT-175B in ~3 hours on single A100 (SparseGPT)

Baselineunpruned OPT-175B

Inference speedup (structured pruning)

Valueup to 15×

Baselineuncompressed BERT

Weight-only quantization performance

ValueOPT-175B perplexity 8.34 → 8.68; ~3.2× speedup

BaselineOPT-175B full-precision perplexity 8.34

Fine-tuning memory reduction

ValueGPT-3 175B fine-tuning memory 1.2TB → 350GB using LoRA

Baselinestandard full-parameter fine-tuning

Extreme low-rank compression

Valueup to 97.9% parameter reduction reported

Baselineoriginal parameter count

Who Should Care

What To Try In 7 Days

Run weight-only OPTQ (or GPTQ/OPTQ) on a production model to cut memory use and measure task-level quality.

If fine-tuning is needed, test LoRA to reduce GPU memory and speed up iteration.

If inference cost dominates, prototype structured pruning (coarse granularity) on a dev model and measure real latency on target hardware.

Optimization Features

Token Efficiency

  • token pruning (PoWER-BERT, LTP)

Infra Optimization

  • A100-optimized sparsity patterns (2:4)
  • mixed-precision & hardware-aware deployment

Model Optimization

  • pruning
  • quantization
  • low_rank_approximation
  • parameter_sharing
  • efficient_architecture_design

System Optimization

  • column-wise compensation (SparseGPT/OPTQ)
  • block-wise and hardware-aware pruning (ZipLM)

Training Optimization

  • LoRA
  • knowledge_distillation
  • quantization-aware training (QAT)
  • lightweight QAT (partial parameter updates)

Inference Optimization

  • weight-only quantization
  • structured pruning (sublayer/layer, 2:4 sparsity)
  • outlier-aware quantization (OWQ, RPTQ)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Survey compiles existing papers; no new experiments to compare all methods under a single protocol.
  • Emphasis on low-cost methods may under-cover some high-cost but high-accuracy approaches.
  • Hardware-specific speedups depend on target accelerator and software stack.

When Not To Use

  • If you require best possible accuracy and can afford large-scale retraining, prefer specialized QAT/KD/LRA retraining pipelines.
  • Do not rely on aggressive activation quantization (<8-bit) for critical tasks without validation.
  • Avoid one-shot low-cost pruning for high-compression targets without iterative compensation.

Failure Modes

  • High compression can cause large task-specific accuracy drops, especially for decoder-only LLMs.
  • Activation outliers break naive PTQ, causing large performance regressions.
  • Mixed-precision schemes can hurt inference throughput on certain hardware stacks.

Core Entities

Models

  • BERT
  • RoBERTa
  • GPT-2
  • OPT
  • LLaMA
  • BLOOM
  • DynaBERT
  • ALBERT
  • GPT-3

Metrics

  • perplexity
  • Accuracy
  • F1
  • FLOPs reduction
  • inference speedup
  • memory footprint

Datasets

  • MNLI
  • QQP
  • SQuAD 1.1
  • WikiText2

Benchmarks

  • perplexity (WikiText2)
  • Accuracy

Context Entities

Models

  • ZipLM
  • SparseGPT
  • OPTQ
  • OWQ
  • RPTQ
  • LoRA
  • Kprune
  • KCM
  • CoFi

Metrics

  • Accuracy

Datasets

  • small calibration sets (used in PTQ/PTP)
  • task fine-tuning corpora (for QAT/KD)

Benchmarks

  • MNLI, QQP, SQuAD, WikiText2 (as used in comparative tables)