Overview
The paper provides concrete numbers across multiple models, ablations that isolate key components, and an implementation with kernels and latency measurements; this supports practical adoption for memory‑bound LLM inference.
Citations23
Evidence Strength0.90
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 70%
Why It Matters For Business
SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.
Who Should Care
Summary TLDR
SqueezeLLM is a post‑training quantization method that combines sensitivity‑aware non‑uniform quantization with a dense‑and‑sparse weight decomposition. It compresses LLM weights to ~3 bits with near‑lossless generation quality (e.g., LLaMA‑7B perplexity 7.75 vs FP16 7.08) while cutting memory and speeding single‑batch inference (up to ~2.4× on A6000). The method stores a tiny fraction (≈0.45%) of weights in full precision as sparse outliers/sensitive values and quantizes the rest via weighted k‑means centroids guided by Fisher information.
Problem Statement
Generative LLM inference is memory‑bandwidth bound: loading weights limits single‑batch latency. Uniform low‑bit quantization either hurts accuracy or fails to reduce end‑to‑end latency. The paper asks: can we quantize weights to ultra‑low bits (3–4 bit) with minimal quality loss and real latency gains on GPUs?
Main Contribution
Sensitivity‑based non‑uniform quantization: weighted k‑means using Fisher info to place quantization centroids near high‑impact weights.
Dense‑and‑Sparse decomposition: extract tiny fraction of outlier and sensitive weights (~0.45%) and keep them in FP16 sparse storage to shrink dense range.
Key Findings
3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.
Keeping 0.45% of weights as FP16 sparse outliers reduces perplexity further from 7.75 to 7.56 on LLaMA‑7B (3‑bit).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (C4) | LLaMA‑7B 3‑bit SqueezeLLM PPL 7.75 (avg bit 3.02) | FP16 PPL 7.08 | ≈ +0.67 vs FP16; −1.80 vs GPTQ | C4 | Table 1 (LLaMA‑7B 3‑bit) | Table 1 |
| Perplexity (C4) with sparsity | LLaMA‑7B 3‑bit SqueezeLLM (0.45% sparsity) PPL 7.56 | SqueezeLLM dense PPL 7.75 | −0.19 | C4 | Sec. 4.2; Table 1 | Table 1 |
What To Try In 7 Days
Run SqueezeLLM quantization on a small model (7B) with 10–100 calibration samples and compare perplexity to your current PTQ.
Measure single‑batch latency and peak GPU memory before/after; target A6000/A5000 for similar gains.
Test adding 0.05–0.45% sparse FP16 extraction to trade small memory for improved accuracy.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments focus on decoder/generation tasks (single‑batch); encoder or encoder‑decoder uses not fully evaluated.
Hardware modeling uses a roofline/simulation assumption; real gains vary by GPU and kernel stack.
When Not To Use
If your workload is compute‑bound or large‑batch inference where arithmetic intensity is high, weight‑only quantization gives less benefit.
When you cannot tolerate any quality change: even near‑lossless results show small perplexity/accuracy gaps.
Failure Modes
Excessive sparsity can increase runtime and memory due to irregular sparse kernels—careful tuning required.
2‑bit dense quantization without outlier handling can catastrophically degrade perplexity; tiny FP16 sparse fraction is necessary.

