Lossless 3-bit LLM quantization with dense-and-sparse weights

June 13, 20238 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

23

Authors

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

Links

Abstract / PDF

Why It Matters For Business

SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.

Summary TLDR

SqueezeLLM is a post‑training quantization method that combines sensitivity‑aware non‑uniform quantization with a dense‑and‑sparse weight decomposition. It compresses LLM weights to ~3 bits with near‑lossless generation quality (e.g., LLaMA‑7B perplexity 7.75 vs FP16 7.08) while cutting memory and speeding single‑batch inference (up to ~2.4× on A6000). The method stores a tiny fraction (≈0.45%) of weights in full precision as sparse outliers/sensitive values and quantizes the rest via weighted k‑means centroids guided by Fisher information.

Problem Statement

Generative LLM inference is memory‑bandwidth bound: loading weights limits single‑batch latency. Uniform low‑bit quantization either hurts accuracy or fails to reduce end‑to‑end latency. The paper asks: can we quantize weights to ultra‑low bits (3–4 bit) with minimal quality loss and real latency gains on GPUs?

Main Contribution

Sensitivity‑based non‑uniform quantization: weighted k‑means using Fisher info to place quantization centroids near high‑impact weights.

Dense‑and‑Sparse decomposition: extract tiny fraction of outlier and sensitive weights (~0.45%) and keep them in FP16 sparse storage to shrink dense range.

Practical kernels: LUT‑based CUDA kernels and balanced CSR sparse kernels to dequantize and run mixed dense+sparse matvec efficiently.

Key Findings

3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.

NumbersLLaMA‑7B (3‑bit): SqueezeLLM PPL 7.75, FP16 7.08, GPTQ 9.55

Keeping 0.45% of weights as FP16 sparse outliers reduces perplexity further from 7.75 to 7.56 on LLaMA‑7B (3‑bit).

NumbersPPL drop 7.75 → 7.56 (0.19)

On an A6000 GPU, 3‑bit SqueezeLLM yields up to 2.4× single‑batch speedup vs FP16 (LLaMA‑7B: 3.2s → 1.5s for 128 tokens).

NumbersLatency 3.2s (FP16) → 1.5s (SqueezeLLM), speedup 2.1–2.4×

Sensitivity weighting is essential: non‑uniform k‑means without sensitivity gives PPL 18.08 vs sensitivity‑based 7.75 (3‑bit LLaMA‑7B).

NumbersPPL 18.08 → 7.75 with sensitivity

Calibration data needs are small: ~10 examples often suffice to compute Fisher information for effective quantization.

NumbersPerplexity stable from 10 → 100 examples (Table E.7)

Results

Perplexity (C4)

ValueLLaMA‑7B 3‑bit SqueezeLLM PPL 7.75 (avg bit 3.02)

BaselineFP16 PPL 7.08

Perplexity (C4) with sparsity

ValueLLaMA‑7B 3‑bit SqueezeLLM (0.45% sparsity) PPL 7.56

BaselineSqueezeLLM dense PPL 7.75

Latency (128 tokens) on A6000

ValueLLaMA‑7B FP16 3.2s → SqueezeLLM 1.5s (3‑bit)

BaselineFP16 latency 3.2s

Accuracy

ValueVicuna quantized (3/4‑bit) preserves accuracy within ~1–2 pts vs FP16

BaselineVicuna FP16 per model (e.g., 39.1% avg for baseline in Table 2)

Who Should Care

What To Try In 7 Days

Run SqueezeLLM quantization on a small model (7B) with 10–100 calibration samples and compare perplexity to your current PTQ.

Measure single‑batch latency and peak GPU memory before/after; target A6000/A5000 for similar gains.

Test adding 0.05–0.45% sparse FP16 extraction to trade small memory for improved accuracy.

Optimization Features

Infra Optimization

  • memory bandwidth reduction focus (weight only quantization)

Model Optimization

  • sensitivity‑based non‑uniform quantization (weighted k‑means)
  • dense‑and‑sparse weight decomposition

System Optimization

  • overlapped dense + sparse matvec
  • channel‑wise lookup tables

Training Optimization

  • post‑training quantization (no retraining)

Inference Optimization

  • LUT dequantization kernels (3/4‑bit)
  • balanced CSR sparse matvec kernels

Reproducibility

Data Urls

  • C4
  • WikiText2
  • MMLU

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments focus on decoder/generation tasks (single‑batch); encoder or encoder‑decoder uses not fully evaluated.
  • Hardware modeling uses a roofline/simulation assumption; real gains vary by GPU and kernel stack.
  • Computing Fisher and k‑means clustering adds one‑time quantization cost (minutes to ~80 min for 65B).

When Not To Use

  • If your workload is compute‑bound or large‑batch inference where arithmetic intensity is high, weight‑only quantization gives less benefit.
  • When you cannot tolerate any quality change: even near‑lossless results show small perplexity/accuracy gaps.
  • If you lack the ability to run custom CUDA kernels on target hardware.

Failure Modes

  • Excessive sparsity can increase runtime and memory due to irregular sparse kernels—careful tuning required.
  • 2‑bit dense quantization without outlier handling can catastrophically degrade perplexity; tiny FP16 sparse fraction is necessary.
  • Activation ordering/grouping (used by other methods) can cause memory access patterns that hurt latency if permutation costs are high.

Core Entities

Models

  • LLaMA
  • LLaMA2
  • OPT
  • Vicuna

Metrics

  • perplexity
  • latency (s)
  • peak GPU memory (GB)
  • Accuracy

Datasets

  • C4
  • WikiText2
  • MMLU

Benchmarks

  • MMLU
  • Vicuna evaluation (GPT‑4 ranking)
  • Perplexity on C4 and WikiText2