Learn quantization grids that pay attention to loss sensitivity, enabling accurate 2–4-bit LLM compression at large scale

July 14, 20248 min

Overview

Decision SnapshotNeeds Validation

The method is practical: it reuses standard quant formats, includes a fast fused kernel, and shows reproducible gains on many LLMs and real benchmarks; remaining risks are 2-bit stability and implementation of inference kernels for non-uniform formats.

Citations0

Evidence Strength0.70

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Tianyi Zhang, Anshumali Shrivastava

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LeanQuant lets teams compress state-of-the-art LLMs to 2–4 bits with less accuracy loss and on common 48GB GPUs, cutting model memory and inference cost while remaining compatible with standard kernels.

Who Should Care

Summary TLDR

LeanQuant changes how post-training quantization chooses grid points: instead of min–max affine grids, it learns grids that prioritize weights with large loss sensitivity (inverse Hessian outliers). The method works for both affine and non-uniform formats, reduces quantization-induced loss error, and scales to very large models (quantized Llama-3.1-405B on two 48GB GPUs in ~21 hours). It is compatible with standard inference kernels and comes with a fused GPU kernel to make grid learning practical.

Problem Statement

Popular iterative loss-error quantizers use min–max affine grids that miss critical outlier weights (large inverse-Hessian diagonals). This causes high loss error and quality drop, and many accurate alternatives need custom data formats or large hardware, limiting compatibility and scalability for very large LLMs.

Main Contribution

Identify that min–max affine grids poorly preserve precision for inverse-Hessian outliers and cause high loss error during iterative quantization.

Introduce loss-error-aware grid learning that weights parameters by inverse-Hessian diagonals to place grid points where they reduce task loss most.

Key Findings

LeanQuant reduces task loss error and improves accuracy compared to GPTQ and other baselines in low-bit regimes.

Numbers3-bit Llama-3-8B avg zero-shot accuracy +18.38% vs GPTQ; +17.18% vs OmniQuant (Table 1)

Practical UseUse LeanQuant for 3-bit quantization to recover significantly more accuracy than GPTQ/OmniQuant on evaluated benchmarks.

Evidence RefTable 1, Sec. 4.1

LeanQuant scales to very large LLMs with practical GPU resources and time.

NumbersQuantized Llama-3.1-405B in ~21 hours using Quadro RTX 8000-48GB GPUs (Sec. 1, Table 3)

Practical UseYou can quantize models up to 405B on commodity 48GB-GPU setups for deployment without requiring clusters of 80GB GPUs.

Evidence RefSec. 1, Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyLeanQuant aff +18.38% vs GPTQ; +17.18% vs OmniQuantGPTQ / OmniQuant+18.38% / +17.18%11 tasks in Table 1Sec. 4.1, Table 1Table 1
Perplexity (4-bit avg across models)LeanQuant aff avg 5.904 (lower is better)GPTQ avg 6.008-0.104 (avg perplexity)WikiText2 & C4 (Table 7)Table 7, Sec. 4.1Table 7

What To Try In 7 Days

Run LeanQuant aff on a 4-bit copy of your model and compare zero-shot accuracy and perplexity to GPTQ.

If using non-uniform quantization, try LeanQuant nu and measure decoding latency with your inference stack.

Adopt the fused GPU kernel or use the provided implementation to cut quantization time from days to hours on 48GB GPUs.

Optimization Features

Token Efficiency
reduced memory reads lowers decoding latency
Infra Optimization
scales to 123B on single 48GB GPU and 405B on 2×48GBreduces need for 80GB GPU clusters
Model Optimization
quantizationloss-error-aware grid learninggroup-wise and row-wise quantization
System Optimization
works with standard affine and non-uniform formatsenumerative affine search for robust S,Z selection
Inference Optimization
compatibility with affine and non-uniform kernelsfused GPU kernel for grid searchdedicated CUDA kernel for non-uniform inference

Reproducibility

Risks & Boundaries

Limitations

2-bit quantization remains challenging and sometimes yields high perplexity or unstable results (Table 7, Table 14).

Non-uniform formats need lookup-style kernels for efficient inference; extra kernel work is required for deployment (Sec. 3.2.1, J).

When Not To Use

When you require extreme 1–2 bit quantization with guaranteed stability; LeanQuant shows mixed 2-bit results.

If you cannot add small calibration runs (they compute inverse-Hessian diagonals).

Failure Modes

Noisy or unrepresentative calibration inputs produce poor Hessian diagonal estimates and suboptimal grids.

Very low bit widths (2-bit) can still cause large loss increases and high perplexity.

Core Entities

Models

Llama-3.1-405BLlama-3-8BLlama-2-7BLlama-2-13BLlama-3-70BMistral-Large-123BMistral-7BLLaMA-7BLLaMA-13BBERT (experiments)

Metrics

AccuracyperplexityF1 (SQuAD)sum of loss errors (ϵ)GPU memory (peak)quantization time

Datasets

C4WikiText2SQuADLM-evaluation-harness tasks (ARC,LAMBADA,MMLU,HellaSwag,PIQA,Winogrande)MT-Bench (judged by GPT-4o)

Benchmarks

ARCLAMBADAMMLUHellaSwagPIQAWinoGrandeMT-Bench