Practical review of quantization, pruning, distillation and low-rank compression for LLMs

August 15, 20237 min

Overview

Decision SnapshotNeeds Validation

This is a survey compiling many reproducible results; individual methods vary in maturity. PTQ and pruning show reproducible speed/memory gains, but deployment depends on calibration data and hardware kernel support.

Citations37

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 30%

Authors

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang

Links

Abstract / PDF

Why It Matters For Business

Compression cuts model memory, cost and inference latency so LLMs can run on fewer GPUs or at lower cloud cost; pick compression that fits your accuracy budget and hardware.

Who Should Care

Summary TLDR

This is a focused survey of model compression methods for large language models (LLMs). It summarizes quantization (post-training and QAT), pruning (unstructured, structured, N:M), knowledge distillation (black-box and white-box), and low-rank factorization. The paper collects common metrics, benchmarks, representative results (tables for quantization and pruning), and practical challenges like activation outliers, calibration data, and hardware support. It highlights that modern PTQ and pruning methods can yield multi× memory or speed gains with small measured accuracy gaps on standard benchmarks, but trade-offs remain and deployment details (calibration datasets, hardware kernels) are key.

Problem Statement

LLMs deliver strong results but are huge and costly to run. The field lacks a focused summary of how to make LLMs smaller and faster in practice. This survey collects and compares quantization, pruning, distillation and low-rank methods, their metrics, and deployment issues so practitioners can pick methods and understand trade-offs.

Main Contribution

Organizes LLM compression methods into quantization, pruning, distillation, and low-rank factorization with subcategories and examples.

Summarizes evaluation metrics and benchmarks used for compressed LLMs, including model size, FLOPs, MFU, latency, speedup and compression ratio.

Key Findings

Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.

NumbersWikitext-2 perp +0.34; speedup 3.24× (GPTQ, Table 1)

Practical UseIf you need faster inference with little accuracy hit, try 3-bit GPTQ-style weight-only PTQ on similar models and measure perplexity on your task.

Evidence RefTable 1, Section 3.2.1

Activation-aware 8-bit methods can keep accuracy almost unchanged while reducing memory.

NumbersOPT-13B LLM.int8(): C4 perp Δ=0.00; speedup 1.22× (Table 1)

Practical UseUse 8-bit weight+activation PTQ (LLM.int8()/SmoothQuant) to cut memory with near-zero accuracy loss before trying aggressive low-bit schemes.

Evidence RefTable 1, Section 3.2.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity difference (Wikitext-2) and speedupGPTQ: +0.34 perplexity; 3.24× speedupfull-precision LLM+0.34 perp; 3.24×WikiText-2Table 1 reports GPTQ 3-bit on OPT-175B with Δperp 0.34 and 3.24× speedupTable 1, Section 3.2.1
Perplexity difference (C4) and speedupLLM.int8(): 0.00 perplexity change; 1.22× speedupFP16 model0.00 perp; 1.22×C4Table 1 lists LLM.int8() on OPT-13B with C4 Δ=0.00 and speedup 1.22×Table 1, Section 3.2.2

What To Try In 7 Days

Run 8-bit weight+activation PTQ (LLM.int8() or SmoothQuant) on a 7–13B model to validate near-zero accuracy loss.

Apply GPTQ 3-bit PTQ to a noncritical task and measure latency and perplexity to estimate speed/accuracy trade-offs.

If serving long contexts, quantize KV cache (KVQuant/KIVI) to free memory and test throughput gains locally.

Optimization Features

Token Efficiency
KV cache quantization reduces per-token memorycompression ratio reduces storage and transfer costs
Infra Optimization
exploit hardware support for N:M sparsity (Ampere 2:4)use LUT-based GEMM or optimized kernels for quantized matmuls
Model Optimization
quantizationpruningknowledge distillationlow-rank factorizationsemi-structured N:M sparsity
System Optimization
use Roofline Model to assess hardware bottlenecksmatch pruning/quantization format to GPU sparse kernels
Training Optimization
Quantization-Aware Training (QAT)LoRAlayer-wise/task-aware distillation
Inference Optimization
weight-only PTQ for smaller memory footprintAccuracyKV cache quantization to increase context lengthstructured pruning for hardware-friendly speedups

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey-level work: no new method or code provided in paper.

Compression results depend heavily on calibration data and evaluation tasks.

When Not To Use

When absolute maximum accuracy is required (safety-critical or high-stakes decisions).

If you lack a good calibration dataset for PTQ or pruning.

Failure Modes

Accuracy drops from aggressive low-bit quantization or deep pruning.

Activation outliers can cause large quantization errors if not handled.

Core Entities

Models

GPT-175BLLaMA (various sizes)LLaMA2OPT-175BBLOOM-176BGPT-J-6BOPT-13B

Metrics

model sizeFLOPsMean FLOPS Utilization (MFU)inference time (latency)speedup ratiocompression ratioperplexity

Datasets

WikiText-2C4PTBLAMBADAPIQAOpenBookQAGSM8KCommonsenseQAStrategyQAVicuna-InstructionsUser-Oriented-InstructionsEleutherAI LM HarnessBIG-Bench

Benchmarks

BIG-BenchEleutherAI LM HarnessWikitext-2 perplexityC4 perplexityzero-shot instruction benchmarks (Vicuna, User-Oriented)

Context Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)T5/FlanT5GPT2