Overview
The survey aggregates many practical methods; quantization and optimized kernels are immediately usable, while extreme low-bit quantization, aggressive pruning, and MoE require careful testing and infrastructure.
Citations13
Evidence Strength0.86
Confidence0.88
Risk Signals9
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 85%
Production readiness: 80%
Novelty: 50%
Why It Matters For Business
Compression and better kernels let teams run large LLMs on fewer GPUs or even on single workstations, cutting hosting costs and enabling edge/embedded use cases without losing core capabilities.
Who Should Care
Summary TLDR
This 47-page survey reviews methods to shrink and speed up large language models for inference. It groups techniques into quantization, pruning, distillation, compact architectures (faster attention and NAS), and dynamic networks (early-exit, cascades, MoE). Key practical points: post-training quantization (PTQ) makes large models much smaller without retraining; second-order one-shot pruning and small calibration sets let you prune huge models quickly; distillation using synthetic instruction or chain-of-thought datasets can transfer LLM behavior to smaller students. The paper also surveys inference frameworks and kernels (FlashAttention, DeepSpeed, FlexGen, PowerInfer) that matter in real‑
Problem Statement
Transformer LLMs have high memory and compute needs that block deployment on constrained hardware. Two LLM-specific challenges: retraining/finetuning is very expensive, and compressed models must keep broad task generality and emergent abilities. The survey asks: which compression and serving methods work in practice for large (>>1B) models?
Main Contribution
A taxonomy and plain-language review of LLM compression: quantization, pruning, distillation, compact architectures, and dynamic networks.
A focused treatment of LLM-specific challenges: tuning-free PTQ, preserving generality/emergent abilities, and low-cost PEFT approaches.
Key Findings
Quantizing FP32 weights to 4-bit cuts model size roughly to one-eighth.
Post-training, layer-aware rounding methods (e.g., GPTQ/OPTQ) can quantize very large models in reasonable time.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (Wikitext-2) | FP16 10.09 → AWQ 10.46 (OPT-66B) | FP16 model perplexity 10.09 | +0.37 | Wikitext-2 | Table 2: AWQ on OPT-66B | Table 2 |
| Perplexity (Wikitext-2) | FP16 8.34 → GPTQ 8.68 (OPT-175B) | FP16 model perplexity 8.34 | +0.34 | Wikitext-2 | Table 2: OPTQ/GPTQ on OPT-175B | Table 2 |
What To Try In 7 Days
Run GPTQ (OPTQ) PTQ on a medium LLM to measure memory and perplexity change.
Try 4-bit weight-only quantization (LoRC/QLoRA path) for a finetuning workflow on a 13B model.
Benchmark FlashAttention or DeepSpeed Inference for your service to reduce latency before changing models.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Many methods rely on small calibration sets; results depend on calibration quality.
Pruning and aggressive low-bit schemes may harm generative or emergent abilities if not validated.
When Not To Use
Avoid extreme low-bit (2-bit) quantization for tasks requiring reliable step-by-step generation or reasoning without re-training.
Do not use one-shot pruning when you require structured speedups on commodity hardware unless you target N:M sparsity supported by hardware.
Failure Modes
Activation outliers cause quantization collapse and degrade generation (token-by-token accumulation).
MoE routing imbalance leads to undertrained experts and collapse to a few experts.

