A practical survey of compression and speed tricks to run large language models on limited hardware

February 15, 20248 min

Overview

Decision SnapshotReady For Pilot

The survey aggregates many practical methods; quantization and optimized kernels are immediately usable, while extreme low-bit quantization, aggressive pruning, and MoE require careful testing and infrastructure.

Citations13

Evidence Strength0.86

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 85%

Production readiness: 80%

Novelty: 50%

Authors

Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He

Links

Abstract / PDF

Why It Matters For Business

Compression and better kernels let teams run large LLMs on fewer GPUs or even on single workstations, cutting hosting costs and enabling edge/embedded use cases without losing core capabilities.

Who Should Care

Summary TLDR

This 47-page survey reviews methods to shrink and speed up large language models for inference. It groups techniques into quantization, pruning, distillation, compact architectures (faster attention and NAS), and dynamic networks (early-exit, cascades, MoE). Key practical points: post-training quantization (PTQ) makes large models much smaller without retraining; second-order one-shot pruning and small calibration sets let you prune huge models quickly; distillation using synthetic instruction or chain-of-thought datasets can transfer LLM behavior to smaller students. The paper also surveys inference frameworks and kernels (FlashAttention, DeepSpeed, FlexGen, PowerInfer) that matter in real‑

Problem Statement

Transformer LLMs have high memory and compute needs that block deployment on constrained hardware. Two LLM-specific challenges: retraining/finetuning is very expensive, and compressed models must keep broad task generality and emergent abilities. The survey asks: which compression and serving methods work in practice for large (>>1B) models?

Main Contribution

A taxonomy and plain-language review of LLM compression: quantization, pruning, distillation, compact architectures, and dynamic networks.

A focused treatment of LLM-specific challenges: tuning-free PTQ, preserving generality/emergent abilities, and low-cost PEFT approaches.

Key Findings

Quantizing FP32 weights to 4-bit cuts model size roughly to one-eighth.

Numbers≈1/8 model size when FP32→INT4

Practical UseUse 4-bit weight quantization to reduce memory pressure on inference devices; expect large memory savings but test downstream accuracy before deployment.

Evidence RefSection 3 (Quantization basic concepts)

Post-training, layer-aware rounding methods (e.g., GPTQ/OPTQ) can quantize very large models in reasonable time.

NumbersGPTQ quantizes OPT‑175B in ~4 hours on one A100

Practical UseIf you lack resources to retrain, try GPTQ-style PTQ to get practical low-bit models without full QAT.

Evidence RefSection 3.3.1 (GPTQ description)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (Wikitext-2)FP16 10.09 → AWQ 10.46 (OPT-66B)FP16 model perplexity 10.09+0.37Wikitext-2Table 2: AWQ on OPT-66BTable 2
Perplexity (Wikitext-2)FP16 8.34 → GPTQ 8.68 (OPT-175B)FP16 model perplexity 8.34+0.34Wikitext-2Table 2: OPTQ/GPTQ on OPT-175BTable 2

What To Try In 7 Days

Run GPTQ (OPTQ) PTQ on a medium LLM to measure memory and perplexity change.

Try 4-bit weight-only quantization (LoRC/QLoRA path) for a finetuning workflow on a 13B model.

Benchmark FlashAttention or DeepSpeed Inference for your service to reduce latency before changing models.

Optimization Features

Token Efficiency
context compression and token pruningearly-exit and cascade inference
Infra Optimization
tensor/pipeline/expert parallelismNVMe and CPU offload for huge modelslatency-aware deployment (DeepSpeed Inference)
Model Optimization
quantization (PTQ and QAT)unstructured and structured pruningknowledge distillationMoE
System Optimization
GPU memory aggregationGPU-CPU hybrid serving (PowerInfer)specialized kernels (LUT-GEMM)
Training Optimization
quantization-aware training (QAT)LoRAlayerwise or stage-wise distillation
Inference Optimization
operator fusionFlashAttention kernelsweight-only vs weight+activation tradeoffsoffloading (FlexGen, DeepSpeed heterogeneous)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Many methods rely on small calibration sets; results depend on calibration quality.

Pruning and aggressive low-bit schemes may harm generative or emergent abilities if not validated.

When Not To Use

Avoid extreme low-bit (2-bit) quantization for tasks requiring reliable step-by-step generation or reasoning without re-training.

Do not use one-shot pruning when you require structured speedups on commodity hardware unless you target N:M sparsity supported by hardware.

Failure Modes

Activation outliers cause quantization collapse and degrade generation (token-by-token accumulation).

MoE routing imbalance leads to undertrained experts and collapse to a few experts.

Core Entities

Models

BERTGPT-2GPT-3OPTBLOOMGLM-130BLLaMALLaMA-65BGPT-4 (referenced)Mixtral / Mistral variants

Metrics

Perplexity (lower better)AccuracyThroughput (tokens/s)Latency (ms)Model size (GB)

Datasets

Wikitext-2C4LAMBADAinstruction/CoT synthetic datasets (various sizes)

Benchmarks

PerplexityAccuracyDownstream task finetune metrics