Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Overview

Decision SnapshotNeeds Validation

The paper compiles recent, practical signals about deployment-ready techniques (PTQ, pruning, decoding tricks) but reports no original experiments; apply methods after hardware compatibility checks.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 30%

Authors

Leo Donisch, Sigurd Schacht, Carsten Lanquillon

Links

Abstract / PDF

Why It Matters For Business

Optimizations can cut memory and latency costs enough to enable local hosting, cheaper cloud inference, or higher throughput; pick a method by the hardware you control and quality you must preserve.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

This is a focused literature review of practical inference optimizations for large language models. It compares quantization, pruning, knowledge distillation, and architectural/decoding changes. The paper explains where each method saves memory, latency, or cost, lists main failure modes (outliers, hardware limits, copied hallucinations), and gives practical constraints such as hardware kernel support and compute needed for compression. It aims to help engineers pick a method by trade-offs rather than present new experiments.

Problem Statement

Modern large language models are powerful but costly to run. Teams need practical ways to reduce memory, latency, and inference cost while preserving model quality. The paper surveys techniques (quantization, pruning, distillation, attention/decoding optimizations) and compares their practical trade-offs and deployment constraints.

Main Contribution

Taxonomy of inference optimizations covering quantization, pruning, knowledge distillation, and architectural/decoding tricks

Concise practical discussion of pros/cons, resource needs, and deployment constraints for each technique

Key Findings

Speculative decoding can cut inference latency substantially by proposing tokens from a smaller model and checking them with the target model.

Numbers2x–3x latency improvement reported

Practical UseTry speculative decoding for generation-heavy workloads; it can halve or third latency if integration with two-model flow is acceptable.

Evidence Ref[83]

Unstructured pruning at scale is feasible using recent algorithms.

NumbersGPT-style 175B model pruned in ≈4 hours (SparseGPT)

Practical UseUse one-shot pruning methods for large models when you need fast compression; plan for significant CPU/GPU work during pruning.

Evidence Ref[50]

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Model FP32 memory example	≈334 GB for a 175B-parameter model in FP32	—	—	—	Memory estimate given for 175B FP32 -> ~334 GB	—
Speculative decoding latency	2x–3x speedup	standard autoregressive decoding	2×–3× lower latency	evaluated on generation, translation, summarization, dialogue	Reported 2x–3x latency improvement without quality loss	[83]

What To Try In 7 Days

Run post-training 8-bit PTQ on a smaller model to measure memory savings and baseline accuracy

Benchmark speculative decoding on a generation pipeline to check 2x–3x latency gains

Test weight-only 4-bit PTQ on a representative workload before attempting activation quantization

Optimization Features

Token Efficiency

sliding-window attentionattention sinks

Infra Optimization

hardware kernel supportCUTLASS kernel usageavoid dequantize overhead

Model Optimization

quantizationpruningknowledge_distillation

System Optimization

custom GPU kernelsmemory block allocation (paged attention)

Training Optimization

quantization-aware-trainingdistillation trainingLoRA

Inference Optimization

KV cache / paged attentionwindowed attentionFlashAttentionspeculative decoding

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Survey-only: no new experiments or cross-method head-to-head benchmarks

Practical claims depend on cited implementations and hardware support

When Not To Use

When you lack hardware/kernel support for a given quantization format

When absolute parity with the original model's outputs is required

Failure Modes

Quantization can produce outlier-driven errors and quality drops without mixed precision

Pruning may yield irregular sparsity incompatible with target inference hardware

Core Entities

Models

BERTBLOOMLlama2CodeLlamaGPT-style models

Metrics

latencymemory consumptionbit-widthinference throughput

Datasets

SQuADSST-2

Context Entities

Models

LLM.int8()GPTQAWQOWQSpQRZeroQuantZeroQuantV2SmoothQuantOmniQuantSparseGPTLoRALLM-PrunerPrune and TuneWandaMiniLLMLaMini-LMvLLMSpeculative DecodingFlashAttention

Metrics

bits (8,4,3)hours-to-compressmemory-GBacceptance-rate (speculative decoding)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Speculative decoding can cut inference latency substantially by proposing tokens from a smaller model and checking them with the target model.

Unstructured pruning at scale is feasible using recent algorithms.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding