Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

August 6, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.8

Citation Count

0

Authors

Leo Donisch, Sigurd Schacht, Carsten Lanquillon

Links

Abstract / PDF

Why It Matters For Business

Optimizations can cut memory and latency costs enough to enable local hosting, cheaper cloud inference, or higher throughput; pick a method by the hardware you control and quality you must preserve.

Summary TLDR

This is a focused literature review of practical inference optimizations for large language models. It compares quantization, pruning, knowledge distillation, and architectural/decoding changes. The paper explains where each method saves memory, latency, or cost, lists main failure modes (outliers, hardware limits, copied hallucinations), and gives practical constraints such as hardware kernel support and compute needed for compression. It aims to help engineers pick a method by trade-offs rather than present new experiments.

Problem Statement

Modern large language models are powerful but costly to run. Teams need practical ways to reduce memory, latency, and inference cost while preserving model quality. The paper surveys techniques (quantization, pruning, distillation, attention/decoding optimizations) and compares their practical trade-offs and deployment constraints.

Main Contribution

Taxonomy of inference optimizations covering quantization, pruning, knowledge distillation, and architectural/decoding tricks

Concise practical discussion of pros/cons, resource needs, and deployment constraints for each technique

Collection of representative methods and implementation notes (e.g., LLM.int8, GPTQ, AWQ, ZeroQuant, SparseGPT, vLLM, speculative decoding)

Key Findings

Speculative decoding can cut inference latency substantially by proposing tokens from a smaller model and checking them with the target model.

Numbers2x–3x latency improvement reported

Unstructured pruning at scale is feasible using recent algorithms.

NumbersGPT-style 175B model pruned in ≈4 hours (SparseGPT)

4-bit quantization often hits a sweet spot between compression and quality on evaluated models.

Numbers4-bit described as 'almost universally optimal' in prior work

8-bit mixed approaches can reduce memory but not always speed up inference due to dequantization costs.

NumbersLLM.Int8 shows memory cut but limited speed-up without optimized kernels

Activation quantization is harder than weight quantization and often hurts quality more.

NumbersMultiple works recommend keeping activations at higher bitwidth (e.g., weights 4-bit, activations 8-bit)

Some PTQ and sparse-quantized methods can compress very large models in hours using optimized code/implementations.

NumbersZeroQuant and SpQR compress ~175B models in hours

Results

Model FP32 memory example

Value≈334 GB for a 175B-parameter model in FP32

Speculative decoding latency

Value2x–3x speedup

Baselinestandard autoregressive decoding

Pruning time at scale

Value≈4 hours to prune a 175B GPT-style model

Quantization bit targets

ValueCommon targets: 8-bit, 4-bit, and 3–4 bits (SpQR)

BaselineFP16/FP32

Who Should Care

What To Try In 7 Days

Run post-training 8-bit PTQ on a smaller model to measure memory savings and baseline accuracy

Benchmark speculative decoding on a generation pipeline to check 2x–3x latency gains

Test weight-only 4-bit PTQ on a representative workload before attempting activation quantization

Optimization Features

Token Efficiency

  • sliding-window attention
  • attention sinks

Infra Optimization

  • hardware kernel support
  • CUTLASS kernel usage
  • avoid dequantize overhead

Model Optimization

  • quantization
  • pruning
  • knowledge_distillation

System Optimization

  • custom GPU kernels
  • memory block allocation (paged attention)

Training Optimization

  • quantization-aware-training
  • distillation training
  • LoRA

Inference Optimization

  • KV cache / paged attention
  • windowed attention
  • FlashAttention
  • speculative decoding

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Survey-only: no new experiments or cross-method head-to-head benchmarks
  • Practical claims depend on cited implementations and hardware support
  • Coverage varies by topic; some methods lack uniform evaluation guidance

When Not To Use

  • When you lack hardware/kernel support for a given quantization format
  • When absolute parity with the original model's outputs is required
  • When you cannot afford the compute/memory overhead of training-distillation pipelines

Failure Modes

  • Quantization can produce outlier-driven errors and quality drops without mixed precision
  • Pruning may yield irregular sparsity incompatible with target inference hardware
  • Distillation can transfer hallucinations, bias, or undesirable behaviors from teacher to student

Core Entities

Models

  • BERT
  • BLOOM
  • Llama2
  • CodeLlama
  • GPT-style models

Metrics

  • latency
  • memory consumption
  • bit-width
  • inference throughput

Datasets

  • SQuAD
  • SST-2

Context Entities

Models

  • LLM.int8()
  • GPTQ
  • AWQ
  • OWQ
  • SpQR
  • ZeroQuant
  • ZeroQuantV2
  • SmoothQuant
  • OmniQuant
  • SparseGPT
  • LoRA
  • LLM-Pruner
  • Prune and Tune
  • Wanda
  • MiniLLM
  • LaMini-LM
  • vLLM
  • Speculative Decoding
  • FlashAttention

Metrics

  • bits (8,4,3)
  • hours-to-compress
  • memory-GB
  • acceptance-rate (speculative decoding)