Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Optimizations can cut memory and latency costs enough to enable local hosting, cheaper cloud inference, or higher throughput; pick a method by the hardware you control and quality you must preserve.
Summary TLDR
This is a focused literature review of practical inference optimizations for large language models. It compares quantization, pruning, knowledge distillation, and architectural/decoding changes. The paper explains where each method saves memory, latency, or cost, lists main failure modes (outliers, hardware limits, copied hallucinations), and gives practical constraints such as hardware kernel support and compute needed for compression. It aims to help engineers pick a method by trade-offs rather than present new experiments.
Problem Statement
Modern large language models are powerful but costly to run. Teams need practical ways to reduce memory, latency, and inference cost while preserving model quality. The paper surveys techniques (quantization, pruning, distillation, attention/decoding optimizations) and compares their practical trade-offs and deployment constraints.
Main Contribution
Taxonomy of inference optimizations covering quantization, pruning, knowledge distillation, and architectural/decoding tricks
Concise practical discussion of pros/cons, resource needs, and deployment constraints for each technique
Collection of representative methods and implementation notes (e.g., LLM.int8, GPTQ, AWQ, ZeroQuant, SparseGPT, vLLM, speculative decoding)
Key Findings
Speculative decoding can cut inference latency substantially by proposing tokens from a smaller model and checking them with the target model.
Unstructured pruning at scale is feasible using recent algorithms.
4-bit quantization often hits a sweet spot between compression and quality on evaluated models.
8-bit mixed approaches can reduce memory but not always speed up inference due to dequantization costs.
Activation quantization is harder than weight quantization and often hurts quality more.
Some PTQ and sparse-quantized methods can compress very large models in hours using optimized code/implementations.
Results
Model FP32 memory example
Speculative decoding latency
Pruning time at scale
Quantization bit targets
Who Should Care
What To Try In 7 Days
Run post-training 8-bit PTQ on a smaller model to measure memory savings and baseline accuracy
Benchmark speculative decoding on a generation pipeline to check 2x–3x latency gains
Test weight-only 4-bit PTQ on a representative workload before attempting activation quantization
Optimization Features
Token Efficiency
- sliding-window attention
- attention sinks
Infra Optimization
- hardware kernel support
- CUTLASS kernel usage
- avoid dequantize overhead
Model Optimization
- quantization
- pruning
- knowledge_distillation
System Optimization
- custom GPU kernels
- memory block allocation (paged attention)
Training Optimization
- quantization-aware-training
- distillation training
- LoRA
Inference Optimization
- KV cache / paged attention
- windowed attention
- FlashAttention
- speculative decoding
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Survey-only: no new experiments or cross-method head-to-head benchmarks
- Practical claims depend on cited implementations and hardware support
- Coverage varies by topic; some methods lack uniform evaluation guidance
When Not To Use
- When you lack hardware/kernel support for a given quantization format
- When absolute parity with the original model's outputs is required
- When you cannot afford the compute/memory overhead of training-distillation pipelines
Failure Modes
- Quantization can produce outlier-driven errors and quality drops without mixed precision
- Pruning may yield irregular sparsity incompatible with target inference hardware
- Distillation can transfer hallucinations, bias, or undesirable behaviors from teacher to student
Core Entities
Models
- BERT
- BLOOM
- Llama2
- CodeLlama
- GPT-style models
Metrics
- latency
- memory consumption
- bit-width
- inference throughput
Datasets
- SQuAD
- SST-2
Context Entities
Models
- LLM.int8()
- GPTQ
- AWQ
- OWQ
- SpQR
- ZeroQuant
- ZeroQuantV2
- SmoothQuant
- OmniQuant
- SparseGPT
- LoRA
- LLM-Pruner
- Prune and Tune
- Wanda
- MiniLLM
- LaMini-LM
- vLLM
- Speculative Decoding
- FlashAttention
Metrics
- bits (8,4,3)
- hours-to-compress
- memory-GB
- acceptance-rate (speculative decoding)

