Lossless 3-bit LLM quantization with dense-and-sparse weights

June 13, 20238 min

Overview

Decision SnapshotReady For Pilot

The paper provides concrete numbers across multiple models, ablations that isolate key components, and an implementation with kernels and latency measurements; this supports practical adoption for memory‑bound LLM inference.

Citations23

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 70%

Authors

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.

Who Should Care

Summary TLDR

SqueezeLLM is a post‑training quantization method that combines sensitivity‑aware non‑uniform quantization with a dense‑and‑sparse weight decomposition. It compresses LLM weights to ~3 bits with near‑lossless generation quality (e.g., LLaMA‑7B perplexity 7.75 vs FP16 7.08) while cutting memory and speeding single‑batch inference (up to ~2.4× on A6000). The method stores a tiny fraction (≈0.45%) of weights in full precision as sparse outliers/sensitive values and quantizes the rest via weighted k‑means centroids guided by Fisher information.

Problem Statement

Generative LLM inference is memory‑bandwidth bound: loading weights limits single‑batch latency. Uniform low‑bit quantization either hurts accuracy or fails to reduce end‑to‑end latency. The paper asks: can we quantize weights to ultra‑low bits (3–4 bit) with minimal quality loss and real latency gains on GPUs?

Main Contribution

Sensitivity‑based non‑uniform quantization: weighted k‑means using Fisher info to place quantization centroids near high‑impact weights.

Dense‑and‑Sparse decomposition: extract tiny fraction of outlier and sensitive weights (~0.45%) and keep them in FP16 sparse storage to shrink dense range.

Key Findings

3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.

NumbersLLaMA‑7B (3‑bit): SqueezeLLM PPL 7.75, FP16 7.08, GPTQ 9.55

Practical UseYou can quantize a 7B model to ~3 bits with ~0.7–0.9 perplexity loss vs FP16 and substantially better accuracy than common PTQ tools; try SqueezeLLM instead of GPTQ when accuracy matters.

Evidence RefTable 1

Keeping 0.45% of weights as FP16 sparse outliers reduces perplexity further from 7.75 to 7.56 on LLaMA‑7B (3‑bit).

NumbersPPL drop 7.757.56 (0.19)

Practical UseExtracting a tiny fraction of weights (≤0.5%) as sparse FP16 often recovers low‑bit quality at negligible memory and speed cost—use this sparsity knob when 3‑bit accuracy still lags.

Evidence RefSec. 4.2; Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (C4)LLaMA‑7B 3‑bit SqueezeLLM PPL 7.75 (avg bit 3.02)FP16 PPL 7.08≈ +0.67 vs FP16; −1.80 vs GPTQC4Table 1 (LLaMA‑7B 3‑bit)Table 1
Perplexity (C4) with sparsityLLaMA‑7B 3‑bit SqueezeLLM (0.45% sparsity) PPL 7.56SqueezeLLM dense PPL 7.75−0.19C4Sec. 4.2; Table 1Table 1

What To Try In 7 Days

Run SqueezeLLM quantization on a small model (7B) with 10–100 calibration samples and compare perplexity to your current PTQ.

Measure single‑batch latency and peak GPU memory before/after; target A6000/A5000 for similar gains.

Test adding 0.05–0.45% sparse FP16 extraction to trade small memory for improved accuracy.

Optimization Features

Infra Optimization
memory bandwidth reduction focus (weight only quantization)
Model Optimization
sensitivity‑based non‑uniform quantization (weighted k‑means)dense‑and‑sparse weight decomposition
System Optimization
overlapped dense + sparse matvecchannel‑wise lookup tables
Training Optimization
post‑training quantization (no retraining)
Inference Optimization
LUT dequantization kernels (3/4‑bit)balanced CSR sparse matvec kernels

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

C4WikiText2MMLU

Risks & Boundaries

Limitations

Experiments focus on decoder/generation tasks (single‑batch); encoder or encoder‑decoder uses not fully evaluated.

Hardware modeling uses a roofline/simulation assumption; real gains vary by GPU and kernel stack.

When Not To Use

If your workload is compute‑bound or large‑batch inference where arithmetic intensity is high, weight‑only quantization gives less benefit.

When you cannot tolerate any quality change: even near‑lossless results show small perplexity/accuracy gaps.

Failure Modes

Excessive sparsity can increase runtime and memory due to irregular sparse kernels—careful tuning required.

2‑bit dense quantization without outlier handling can catastrophically degrade perplexity; tiny FP16 sparse fraction is necessary.

Core Entities

Models

LLaMALLaMA2OPTVicuna

Metrics

perplexitylatency (s)peak GPU memory (GB)Accuracy

Datasets

C4WikiText2MMLU

Benchmarks

MMLUVicuna evaluation (GPT‑4 ranking)Perplexity on C4 and WikiText2