Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

August 6, 20246 min

Overview

Decision SnapshotNeeds Validation

The paper compiles recent, practical signals about deployment-ready techniques (PTQ, pruning, decoding tricks) but reports no original experiments; apply methods after hardware compatibility checks.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 30%

Authors

Leo Donisch, Sigurd Schacht, Carsten Lanquillon

Links

Abstract / PDF

Why It Matters For Business

Optimizations can cut memory and latency costs enough to enable local hosting, cheaper cloud inference, or higher throughput; pick a method by the hardware you control and quality you must preserve.

Who Should Care

Summary TLDR

This is a focused literature review of practical inference optimizations for large language models. It compares quantization, pruning, knowledge distillation, and architectural/decoding changes. The paper explains where each method saves memory, latency, or cost, lists main failure modes (outliers, hardware limits, copied hallucinations), and gives practical constraints such as hardware kernel support and compute needed for compression. It aims to help engineers pick a method by trade-offs rather than present new experiments.

Problem Statement

Modern large language models are powerful but costly to run. Teams need practical ways to reduce memory, latency, and inference cost while preserving model quality. The paper surveys techniques (quantization, pruning, distillation, attention/decoding optimizations) and compares their practical trade-offs and deployment constraints.

Main Contribution

Taxonomy of inference optimizations covering quantization, pruning, knowledge distillation, and architectural/decoding tricks

Concise practical discussion of pros/cons, resource needs, and deployment constraints for each technique

Key Findings

Speculative decoding can cut inference latency substantially by proposing tokens from a smaller model and checking them with the target model.

Numbers2x3x latency improvement reported

Practical UseTry speculative decoding for generation-heavy workloads; it can halve or third latency if integration with two-model flow is acceptable.

Evidence Ref[83]

Unstructured pruning at scale is feasible using recent algorithms.

NumbersGPT-style 175B model pruned in ≈4 hours (SparseGPT)

Practical UseUse one-shot pruning methods for large models when you need fast compression; plan for significant CPU/GPU work during pruning.

Evidence Ref[50]

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Model FP32 memory example≈334 GB for a 175B-parameter model in FP32Memory estimate given for 175B FP32 -> ~334 GB
Speculative decoding latency2x3x speedupstandard autoregressive decoding–3× lower latencyevaluated on generation, translation, summarization, dialogueReported 2x–3x latency improvement without quality loss[83]

What To Try In 7 Days

Run post-training 8-bit PTQ on a smaller model to measure memory savings and baseline accuracy

Benchmark speculative decoding on a generation pipeline to check 2x–3x latency gains

Test weight-only 4-bit PTQ on a representative workload before attempting activation quantization

Optimization Features

Token Efficiency
sliding-window attentionattention sinks
Infra Optimization
hardware kernel supportCUTLASS kernel usageavoid dequantize overhead
Model Optimization
quantizationpruningknowledge_distillation
System Optimization
custom GPU kernelsmemory block allocation (paged attention)
Training Optimization
quantization-aware-trainingdistillation trainingLoRA
Inference Optimization
KV cache / paged attentionwindowed attentionFlashAttentionspeculative decoding

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Survey-only: no new experiments or cross-method head-to-head benchmarks

Practical claims depend on cited implementations and hardware support

When Not To Use

When you lack hardware/kernel support for a given quantization format

When absolute parity with the original model's outputs is required

Failure Modes

Quantization can produce outlier-driven errors and quality drops without mixed precision

Pruning may yield irregular sparsity incompatible with target inference hardware

Core Entities

Models

BERTBLOOMLlama2CodeLlamaGPT-style models

Metrics

latencymemory consumptionbit-widthinference throughput

Datasets

SQuADSST-2

Context Entities

Models

LLM.int8()GPTQAWQOWQSpQRZeroQuantZeroQuantV2SmoothQuantOmniQuantSparseGPTLoRALLM-PrunerPrune and TuneWandaMiniLLMLaMini-LMvLLMSpeculative DecodingFlashAttention

Metrics

bits (8,4,3)hours-to-compressmemory-GBacceptance-rate (speculative decoding)