A practical survey of compression and speed tricks to run large language models on limited hardware

Overview

Decision SnapshotReady For Pilot

The survey aggregates many practical methods; quantization and optimized kernels are immediately usable, while extreme low-bit quantization, aggressive pruning, and MoE require careful testing and infrastructure.

Citations13

Evidence Strength0.86

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 85%

Production readiness: 80%

Novelty: 50%

Authors

Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He

Links

Abstract / PDF

Why It Matters For Business

Compression and better kernels let teams run large LLMs on fewer GPUs or even on single workstations, cutting hosting costs and enabling edge/embedded use cases without losing core capabilities.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead

Summary TLDR

This 47-page survey reviews methods to shrink and speed up large language models for inference. It groups techniques into quantization, pruning, distillation, compact architectures (faster attention and NAS), and dynamic networks (early-exit, cascades, MoE). Key practical points: post-training quantization (PTQ) makes large models much smaller without retraining; second-order one-shot pruning and small calibration sets let you prune huge models quickly; distillation using synthetic instruction or chain-of-thought datasets can transfer LLM behavior to smaller students. The paper also surveys inference frameworks and kernels (FlashAttention, DeepSpeed, FlexGen, PowerInfer) that matter in real‑

Problem Statement

Transformer LLMs have high memory and compute needs that block deployment on constrained hardware. Two LLM-specific challenges: retraining/finetuning is very expensive, and compressed models must keep broad task generality and emergent abilities. The survey asks: which compression and serving methods work in practice for large (>>1B) models?

Main Contribution

A taxonomy and plain-language review of LLM compression: quantization, pruning, distillation, compact architectures, and dynamic networks.

A focused treatment of LLM-specific challenges: tuning-free PTQ, preserving generality/emergent abilities, and low-cost PEFT approaches.

Key Findings

Quantizing FP32 weights to 4-bit cuts model size roughly to one-eighth.

Numbers≈1/8 model size when FP32→INT4

Practical UseUse 4-bit weight quantization to reduce memory pressure on inference devices; expect large memory savings but test downstream accuracy before deployment.

Evidence RefSection 3 (Quantization basic concepts)

Post-training, layer-aware rounding methods (e.g., GPTQ/OPTQ) can quantize very large models in reasonable time.

NumbersGPTQ quantizes OPT‑175B in ~4 hours on one A100

Practical UseIf you lack resources to retrain, try GPTQ-style PTQ to get practical low-bit models without full QAT.

Evidence RefSection 3.3.1 (GPTQ description)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (Wikitext-2)	FP16 10.09 → AWQ 10.46 (OPT-66B)	FP16 model perplexity 10.09	+0.37	Wikitext-2	Table 2: AWQ on OPT-66B	Table 2
Perplexity (Wikitext-2)	FP16 8.34 → GPTQ 8.68 (OPT-175B)	FP16 model perplexity 8.34	+0.34	Wikitext-2	Table 2: OPTQ/GPTQ on OPT-175B	Table 2

What To Try In 7 Days

Run GPTQ (OPTQ) PTQ on a medium LLM to measure memory and perplexity change.

Try 4-bit weight-only quantization (LoRC/QLoRA path) for a finetuning workflow on a 13B model.

Benchmark FlashAttention or DeepSpeed Inference for your service to reduce latency before changing models.

Optimization Features

Token Efficiency

context compression and token pruningearly-exit and cascade inference

Infra Optimization

tensor/pipeline/expert parallelismNVMe and CPU offload for huge modelslatency-aware deployment (DeepSpeed Inference)

Model Optimization

quantization (PTQ and QAT)unstructured and structured pruningknowledge distillationMoE

System Optimization

GPU memory aggregationGPU-CPU hybrid serving (PowerInfer)specialized kernels (LUT-GEMM)

Training Optimization

quantization-aware training (QAT)LoRAlayerwise or stage-wise distillation

Inference Optimization

operator fusionFlashAttention kernelsweight-only vs weight+activation tradeoffsoffloading (FlexGen, DeepSpeed heterogeneous)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Many methods rely on small calibration sets; results depend on calibration quality.

Pruning and aggressive low-bit schemes may harm generative or emergent abilities if not validated.

When Not To Use

Avoid extreme low-bit (2-bit) quantization for tasks requiring reliable step-by-step generation or reasoning without re-training.

Do not use one-shot pruning when you require structured speedups on commodity hardware unless you target N:M sparsity supported by hardware.

Failure Modes

Activation outliers cause quantization collapse and degrade generation (token-by-token accumulation).

MoE routing imbalance leads to undertrained experts and collapse to a few experts.

Core Entities

Models

BERTGPT-2GPT-3OPTBLOOMGLM-130BLLaMALLaMA-65BGPT-4 (referenced)Mixtral / Mistral variants

Metrics

Perplexity (lower better)AccuracyThroughput (tokens/s)Latency (ms)Model size (GB)

Datasets

Wikitext-2C4LAMBADAinstruction/CoT synthetic datasets (various sizes)

Benchmarks

PerplexityAccuracyDownstream task finetune metrics

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Quantizing FP32 weights to 4-bit cuts model size roughly to one-eighth.

Post-training, layer-aware rounding methods (e.g., GPTQ/OPTQ) can quantize very large models in reasonable time.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding

Survey: how to run reasoning-capable LLMs and autonomous agents on memory- and power-limited edge devices

Key finding