Practical review of quantization, pruning, distillation and low-rank compression for LLMs

Overview

Decision SnapshotNeeds Validation

This is a survey compiling many reproducible results; individual methods vary in maturity. PTQ and pruning show reproducible speed/memory gains, but deployment depends on calibration data and hardware kernel support.

Citations37

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 30%

Authors

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang

Links

Abstract / PDF

Why It Matters For Business

Compression cuts model memory, cost and inference latency so LLMs can run on fewer GPUs or at lower cloud cost; pick compression that fits your accuracy budget and hardware.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

This is a focused survey of model compression methods for large language models (LLMs). It summarizes quantization (post-training and QAT), pruning (unstructured, structured, N:M), knowledge distillation (black-box and white-box), and low-rank factorization. The paper collects common metrics, benchmarks, representative results (tables for quantization and pruning), and practical challenges like activation outliers, calibration data, and hardware support. It highlights that modern PTQ and pruning methods can yield multi× memory or speed gains with small measured accuracy gaps on standard benchmarks, but trade-offs remain and deployment details (calibration datasets, hardware kernels) are key.

Problem Statement

LLMs deliver strong results but are huge and costly to run. The field lacks a focused summary of how to make LLMs smaller and faster in practice. This survey collects and compares quantization, pruning, distillation and low-rank methods, their metrics, and deployment issues so practitioners can pick methods and understand trade-offs.

Main Contribution

Organizes LLM compression methods into quantization, pruning, distillation, and low-rank factorization with subcategories and examples.

Summarizes evaluation metrics and benchmarks used for compressed LLMs, including model size, FLOPs, MFU, latency, speedup and compression ratio.

Key Findings

Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.

NumbersWikitext-2 perp +0.34; speedup 3.24× (GPTQ, Table 1)

Practical UseIf you need faster inference with little accuracy hit, try 3-bit GPTQ-style weight-only PTQ on similar models and measure perplexity on your task.

Evidence RefTable 1, Section 3.2.1

Activation-aware 8-bit methods can keep accuracy almost unchanged while reducing memory.

NumbersOPT-13B LLM.int8(): C4 perp Δ=0.00; speedup 1.22× (Table 1)

Practical UseUse 8-bit weight+activation PTQ (LLM.int8()/SmoothQuant) to cut memory with near-zero accuracy loss before trying aggressive low-bit schemes.

Evidence RefTable 1, Section 3.2.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity difference (Wikitext-2) and speedup	GPTQ: +0.34 perplexity; 3.24× speedup	full-precision LLM	+0.34 perp; 3.24×	WikiText-2	Table 1 reports GPTQ 3-bit on OPT-175B with Δperp 0.34 and 3.24× speedup	Table 1, Section 3.2.1
Perplexity difference (C4) and speedup	LLM.int8(): 0.00 perplexity change; 1.22× speedup	FP16 model	0.00 perp; 1.22×	C4	Table 1 lists LLM.int8() on OPT-13B with C4 Δ=0.00 and speedup 1.22×	Table 1, Section 3.2.2

What To Try In 7 Days

Run 8-bit weight+activation PTQ (LLM.int8() or SmoothQuant) on a 7–13B model to validate near-zero accuracy loss.

Apply GPTQ 3-bit PTQ to a noncritical task and measure latency and perplexity to estimate speed/accuracy trade-offs.

If serving long contexts, quantize KV cache (KVQuant/KIVI) to free memory and test throughput gains locally.

Optimization Features

Token Efficiency

KV cache quantization reduces per-token memorycompression ratio reduces storage and transfer costs

Infra Optimization

exploit hardware support for N:M sparsity (Ampere 2:4)use LUT-based GEMM or optimized kernels for quantized matmuls

Model Optimization

quantizationpruningknowledge distillationlow-rank factorizationsemi-structured N:M sparsity

System Optimization

use Roofline Model to assess hardware bottlenecksmatch pruning/quantization format to GPU sparse kernels

Training Optimization

Quantization-Aware Training (QAT)LoRAlayer-wise/task-aware distillation

Inference Optimization

weight-only PTQ for smaller memory footprintAccuracyKV cache quantization to increase context lengthstructured pruning for hardware-friendly speedups

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Survey-level work: no new method or code provided in paper.

Compression results depend heavily on calibration data and evaluation tasks.

When Not To Use

When absolute maximum accuracy is required (safety-critical or high-stakes decisions).

If you lack a good calibration dataset for PTQ or pruning.

Failure Modes

Accuracy drops from aggressive low-bit quantization or deep pruning.

Activation outliers can cause large quantization errors if not handled.

Core Entities

Models

GPT-175BLLaMA (various sizes)LLaMA2OPT-175BBLOOM-176BGPT-J-6BOPT-13B

Metrics

model sizeFLOPsMean FLOPS Utilization (MFU)inference time (latency)speedup ratiocompression ratioperplexity

Datasets

WikiText-2C4PTBLAMBADAPIQAOpenBookQAGSM8KCommonsenseQAStrategyQAVicuna-InstructionsUser-Oriented-InstructionsEleutherAI LM HarnessBIG-Bench

Benchmarks

BIG-BenchEleutherAI LM HarnessWikitext-2 perplexityC4 perplexityzero-shot instruction benchmarks (Vicuna, User-Oriented)

Context Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)T5/FlanT5GPT2

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.

Activation-aware 8-bit methods can keep accuracy almost unchanged while reducing memory.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding