Overview
Production Readiness
0.8
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
23
Why It Matters For Business
SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.
Summary TLDR
SqueezeLLM is a post‑training quantization method that combines sensitivity‑aware non‑uniform quantization with a dense‑and‑sparse weight decomposition. It compresses LLM weights to ~3 bits with near‑lossless generation quality (e.g., LLaMA‑7B perplexity 7.75 vs FP16 7.08) while cutting memory and speeding single‑batch inference (up to ~2.4× on A6000). The method stores a tiny fraction (≈0.45%) of weights in full precision as sparse outliers/sensitive values and quantizes the rest via weighted k‑means centroids guided by Fisher information.
Problem Statement
Generative LLM inference is memory‑bandwidth bound: loading weights limits single‑batch latency. Uniform low‑bit quantization either hurts accuracy or fails to reduce end‑to‑end latency. The paper asks: can we quantize weights to ultra‑low bits (3–4 bit) with minimal quality loss and real latency gains on GPUs?
Main Contribution
Sensitivity‑based non‑uniform quantization: weighted k‑means using Fisher info to place quantization centroids near high‑impact weights.
Dense‑and‑Sparse decomposition: extract tiny fraction of outlier and sensitive weights (~0.45%) and keep them in FP16 sparse storage to shrink dense range.
Practical kernels: LUT‑based CUDA kernels and balanced CSR sparse kernels to dequantize and run mixed dense+sparse matvec efficiently.
Key Findings
3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.
Keeping 0.45% of weights as FP16 sparse outliers reduces perplexity further from 7.75 to 7.56 on LLaMA‑7B (3‑bit).
On an A6000 GPU, 3‑bit SqueezeLLM yields up to 2.4× single‑batch speedup vs FP16 (LLaMA‑7B: 3.2s → 1.5s for 128 tokens).
Sensitivity weighting is essential: non‑uniform k‑means without sensitivity gives PPL 18.08 vs sensitivity‑based 7.75 (3‑bit LLaMA‑7B).
Calibration data needs are small: ~10 examples often suffice to compute Fisher information for effective quantization.
Results
Perplexity (C4)
Perplexity (C4) with sparsity
Latency (128 tokens) on A6000
Accuracy
Who Should Care
What To Try In 7 Days
Run SqueezeLLM quantization on a small model (7B) with 10–100 calibration samples and compare perplexity to your current PTQ.
Measure single‑batch latency and peak GPU memory before/after; target A6000/A5000 for similar gains.
Test adding 0.05–0.45% sparse FP16 extraction to trade small memory for improved accuracy.
Optimization Features
Infra Optimization
- memory bandwidth reduction focus (weight only quantization)
Model Optimization
- sensitivity‑based non‑uniform quantization (weighted k‑means)
- dense‑and‑sparse weight decomposition
System Optimization
- overlapped dense + sparse matvec
- channel‑wise lookup tables
Training Optimization
- post‑training quantization (no retraining)
Inference Optimization
- LUT dequantization kernels (3/4‑bit)
- balanced CSR sparse matvec kernels
Reproducibility
Data Urls
- C4
- WikiText2
- MMLU
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments focus on decoder/generation tasks (single‑batch); encoder or encoder‑decoder uses not fully evaluated.
- Hardware modeling uses a roofline/simulation assumption; real gains vary by GPU and kernel stack.
- Computing Fisher and k‑means clustering adds one‑time quantization cost (minutes to ~80 min for 65B).
When Not To Use
- If your workload is compute‑bound or large‑batch inference where arithmetic intensity is high, weight‑only quantization gives less benefit.
- When you cannot tolerate any quality change: even near‑lossless results show small perplexity/accuracy gaps.
- If you lack the ability to run custom CUDA kernels on target hardware.
Failure Modes
- Excessive sparsity can increase runtime and memory due to irregular sparse kernels—careful tuning required.
- 2‑bit dense quantization without outlier handling can catastrophically degrade perplexity; tiny FP16 sparse fraction is necessary.
- Activation ordering/grouping (used by other methods) can cause memory access patterns that hurt latency if permutation costs are high.
Core Entities
Models
- LLaMA
- LLaMA2
- OPT
- Vicuna
Metrics
- perplexity
- latency (s)
- peak GPU memory (GB)
- Accuracy
Datasets
- C4
- WikiText2
- MMLU
Benchmarks
- MMLU
- Vicuna evaluation (GPT‑4 ranking)
- Perplexity on C4 and WikiText2

