Lossless 3-bit LLM quantization with dense-and-sparse weights

Overview

Decision SnapshotReady For Pilot

The paper provides concrete numbers across multiple models, ablations that isolate key components, and an implementation with kernels and latency measurements; this supports practical adoption for memory‑bound LLM inference.

Citations23

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 70%

Authors

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist Product Manager

Summary TLDR

SqueezeLLM is a post‑training quantization method that combines sensitivity‑aware non‑uniform quantization with a dense‑and‑sparse weight decomposition. It compresses LLM weights to ~3 bits with near‑lossless generation quality (e.g., LLaMA‑7B perplexity 7.75 vs FP16 7.08) while cutting memory and speeding single‑batch inference (up to ~2.4× on A6000). The method stores a tiny fraction (≈0.45%) of weights in full precision as sparse outliers/sensitive values and quantizes the rest via weighted k‑means centroids guided by Fisher information.

Problem Statement

Generative LLM inference is memory‑bandwidth bound: loading weights limits single‑batch latency. Uniform low‑bit quantization either hurts accuracy or fails to reduce end‑to‑end latency. The paper asks: can we quantize weights to ultra‑low bits (3–4 bit) with minimal quality loss and real latency gains on GPUs?

Main Contribution

Sensitivity‑based non‑uniform quantization: weighted k‑means using Fisher info to place quantization centroids near high‑impact weights.

Dense‑and‑Sparse decomposition: extract tiny fraction of outlier and sensitive weights (~0.45%) and keep them in FP16 sparse storage to shrink dense range.

Key Findings

3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.

NumbersLLaMA‑7B (3‑bit): SqueezeLLM PPL 7.75, FP16 7.08, GPTQ 9.55

Practical UseYou can quantize a 7B model to ~3 bits with ~0.7–0.9 perplexity loss vs FP16 and substantially better accuracy than common PTQ tools; try SqueezeLLM instead of GPTQ when accuracy matters.

Evidence RefTable 1

Keeping 0.45% of weights as FP16 sparse outliers reduces perplexity further from 7.75 to 7.56 on LLaMA‑7B (3‑bit).

NumbersPPL drop 7.75 → 7.56 (0.19)

Practical UseExtracting a tiny fraction of weights (≤0.5%) as sparse FP16 often recovers low‑bit quality at negligible memory and speed cost—use this sparsity knob when 3‑bit accuracy still lags.

Evidence RefSec. 4.2; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (C4)	LLaMA‑7B 3‑bit SqueezeLLM PPL 7.75 (avg bit 3.02)	FP16 PPL 7.08	≈ +0.67 vs FP16; −1.80 vs GPTQ	C4	Table 1 (LLaMA‑7B 3‑bit)	Table 1
Perplexity (C4) with sparsity	LLaMA‑7B 3‑bit SqueezeLLM (0.45% sparsity) PPL 7.56	SqueezeLLM dense PPL 7.75	−0.19	C4	Sec. 4.2; Table 1	Table 1

What To Try In 7 Days

Run SqueezeLLM quantization on a small model (7B) with 10–100 calibration samples and compare perplexity to your current PTQ.

Measure single‑batch latency and peak GPU memory before/after; target A6000/A5000 for similar gains.

Test adding 0.05–0.45% sparse FP16 extraction to trade small memory for improved accuracy.

Optimization Features

Infra Optimization

memory bandwidth reduction focus (weight only quantization)

Model Optimization

sensitivity‑based non‑uniform quantization (weighted k‑means)dense‑and‑sparse weight decomposition

System Optimization

overlapped dense + sparse matvecchannel‑wise lookup tables

Training Optimization

post‑training quantization (no retraining)

Inference Optimization

LUT dequantization kernels (3/4‑bit)balanced CSR sparse matvec kernels

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/SqueezeAILab/SqueezeLLM

Data URLs

C4WikiText2MMLU

Risks & Boundaries

Limitations

Experiments focus on decoder/generation tasks (single‑batch); encoder or encoder‑decoder uses not fully evaluated.

Hardware modeling uses a roofline/simulation assumption; real gains vary by GPU and kernel stack.

When Not To Use

If your workload is compute‑bound or large‑batch inference where arithmetic intensity is high, weight‑only quantization gives less benefit.

When you cannot tolerate any quality change: even near‑lossless results show small perplexity/accuracy gaps.

Failure Modes

Excessive sparsity can increase runtime and memory due to irregular sparse kernels—careful tuning required.

2‑bit dense quantization without outlier handling can catastrophically degrade perplexity; tiny FP16 sparse fraction is necessary.

Core Entities

Models

LLaMALLaMA2OPTVicuna

Metrics

perplexitylatency (s)peak GPU memory (GB)Accuracy

Datasets

C4WikiText2MMLU

Benchmarks

MMLUVicuna evaluation (GPT‑4 ranking)Perplexity on C4 and WikiText2

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.

Keeping 0.45% of weights as FP16 sparse outliers reduces perplexity further from 7.75 to 7.56 on LLaMA‑7B (3‑bit).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding