Overview
The paper compiles recent, practical signals about deployment-ready techniques (PTQ, pruning, decoding tricks) but reports no original experiments; apply methods after hardware compatibility checks.
Citations0
Evidence Strength0.60
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 60%
Novelty: 30%
Why It Matters For Business
Optimizations can cut memory and latency costs enough to enable local hosting, cheaper cloud inference, or higher throughput; pick a method by the hardware you control and quality you must preserve.
Who Should Care
Summary TLDR
This is a focused literature review of practical inference optimizations for large language models. It compares quantization, pruning, knowledge distillation, and architectural/decoding changes. The paper explains where each method saves memory, latency, or cost, lists main failure modes (outliers, hardware limits, copied hallucinations), and gives practical constraints such as hardware kernel support and compute needed for compression. It aims to help engineers pick a method by trade-offs rather than present new experiments.
Problem Statement
Modern large language models are powerful but costly to run. Teams need practical ways to reduce memory, latency, and inference cost while preserving model quality. The paper surveys techniques (quantization, pruning, distillation, attention/decoding optimizations) and compares their practical trade-offs and deployment constraints.
Main Contribution
Taxonomy of inference optimizations covering quantization, pruning, knowledge distillation, and architectural/decoding tricks
Concise practical discussion of pros/cons, resource needs, and deployment constraints for each technique
Key Findings
Speculative decoding can cut inference latency substantially by proposing tokens from a smaller model and checking them with the target model.
Unstructured pruning at scale is feasible using recent algorithms.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Model FP32 memory example | ≈334 GB for a 175B-parameter model in FP32 | — | — | — | Memory estimate given for 175B FP32 -> ~334 GB | — |
| Speculative decoding latency | 2x–3x speedup | standard autoregressive decoding | 2×–3× lower latency | evaluated on generation, translation, summarization, dialogue | Reported 2x–3x latency improvement without quality loss | [83] |
What To Try In 7 Days
Run post-training 8-bit PTQ on a smaller model to measure memory savings and baseline accuracy
Benchmark speculative decoding on a generation pipeline to check 2x–3x latency gains
Test weight-only 4-bit PTQ on a representative workload before attempting activation quantization
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Survey-only: no new experiments or cross-method head-to-head benchmarks
Practical claims depend on cited implementations and hardware support
When Not To Use
When you lack hardware/kernel support for a given quantization format
When absolute parity with the original model's outputs is required
Failure Modes
Quantization can produce outlier-driven errors and quality drops without mixed precision
Pruning may yield irregular sparsity incompatible with target inference hardware

