Overview
This is a survey compiling many reproducible results; individual methods vary in maturity. PTQ and pruning show reproducible speed/memory gains, but deployment depends on calibration data and hardware kernel support.
Citations37
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 60%
Novelty: 30%
Why It Matters For Business
Compression cuts model memory, cost and inference latency so LLMs can run on fewer GPUs or at lower cloud cost; pick compression that fits your accuracy budget and hardware.
Who Should Care
Summary TLDR
This is a focused survey of model compression methods for large language models (LLMs). It summarizes quantization (post-training and QAT), pruning (unstructured, structured, N:M), knowledge distillation (black-box and white-box), and low-rank factorization. The paper collects common metrics, benchmarks, representative results (tables for quantization and pruning), and practical challenges like activation outliers, calibration data, and hardware support. It highlights that modern PTQ and pruning methods can yield multi× memory or speed gains with small measured accuracy gaps on standard benchmarks, but trade-offs remain and deployment details (calibration datasets, hardware kernels) are key.
Problem Statement
LLMs deliver strong results but are huge and costly to run. The field lacks a focused summary of how to make LLMs smaller and faster in practice. This survey collects and compares quantization, pruning, distillation and low-rank methods, their metrics, and deployment issues so practitioners can pick methods and understand trade-offs.
Main Contribution
Organizes LLM compression methods into quantization, pruning, distillation, and low-rank factorization with subcategories and examples.
Summarizes evaluation metrics and benchmarks used for compressed LLMs, including model size, FLOPs, MFU, latency, speedup and compression ratio.
Key Findings
Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.
Activation-aware 8-bit methods can keep accuracy almost unchanged while reducing memory.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity difference (Wikitext-2) and speedup | GPTQ: +0.34 perplexity; 3.24× speedup | full-precision LLM | +0.34 perp; 3.24× | WikiText-2 | Table 1 reports GPTQ 3-bit on OPT-175B with Δperp 0.34 and 3.24× speedup | Table 1, Section 3.2.1 |
| Perplexity difference (C4) and speedup | LLM.int8(): 0.00 perplexity change; 1.22× speedup | FP16 model | 0.00 perp; 1.22× | C4 | Table 1 lists LLM.int8() on OPT-13B with C4 Δ=0.00 and speedup 1.22× | Table 1, Section 3.2.2 |
What To Try In 7 Days
Run 8-bit weight+activation PTQ (LLM.int8() or SmoothQuant) on a 7–13B model to validate near-zero accuracy loss.
Apply GPTQ 3-bit PTQ to a noncritical task and measure latency and perplexity to estimate speed/accuracy trade-offs.
If serving long contexts, quantize KV cache (KVQuant/KIVI) to free memory and test throughput gains locally.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Survey-level work: no new method or code provided in paper.
Compression results depend heavily on calibration data and evaluation tasks.
When Not To Use
When absolute maximum accuracy is required (safety-critical or high-stakes decisions).
If you lack a good calibration dataset for PTQ or pruning.
Failure Modes
Accuracy drops from aggressive low-bit quantization or deep pruning.
Activation outliers can cause large quantization errors if not handled.

