Practical review of quantization, pruning, distillation and low-rank compression for LLMs
Compression cuts model memory, cost and inference latency so LLMs can run on fewer GPUs or at lower cloud cost; pick compression that fits your accuracy budget and hardware.
Key finding
Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.
Numbers: Wikitext-2 perp +0.34; speedup 3.24× (GPTQ, Table 1)

