Learn quantization grids that pay attention to loss sensitivity, enabling accurate 2–4-bit LLM compression at large scale

July 14, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Tianyi Zhang, Anshumali Shrivastava

Links

Abstract / PDF

Why It Matters For Business

LeanQuant lets teams compress state-of-the-art LLMs to 2–4 bits with less accuracy loss and on common 48GB GPUs, cutting model memory and inference cost while remaining compatible with standard kernels.

Summary TLDR

LeanQuant changes how post-training quantization chooses grid points: instead of min–max affine grids, it learns grids that prioritize weights with large loss sensitivity (inverse Hessian outliers). The method works for both affine and non-uniform formats, reduces quantization-induced loss error, and scales to very large models (quantized Llama-3.1-405B on two 48GB GPUs in ~21 hours). It is compatible with standard inference kernels and comes with a fused GPU kernel to make grid learning practical.

Problem Statement

Popular iterative loss-error quantizers use min–max affine grids that miss critical outlier weights (large inverse-Hessian diagonals). This causes high loss error and quality drop, and many accurate alternatives need custom data formats or large hardware, limiting compatibility and scalability for very large LLMs.

Main Contribution

Identify that min–max affine grids poorly preserve precision for inverse-Hessian outliers and cause high loss error during iterative quantization.

Introduce loss-error-aware grid learning that weights parameters by inverse-Hessian diagonals to place grid points where they reduce task loss most.

Provide algorithms for both non-uniform (weighted k-means with uniform initialization) and affine (enumerative search) loss-error-aware grids plus a fused GPU kernel to make grid search fast.

Demonstrate accuracy and scalability: outperforms or matches strong baselines across 2–4 bits and quantizes Llama-3.1-405B using two 48GB GPUs.

Key Findings

LeanQuant reduces task loss error and improves accuracy compared to GPTQ and other baselines in low-bit regimes.

Numbers3-bit Llama-3-8B avg zero-shot accuracy +18.38% vs GPTQ; +17.18% vs OmniQuant (Table 1)

LeanQuant scales to very large LLMs with practical GPU resources and time.

NumbersQuantized Llama-3.1-405B in ~21 hours using 2× Quadro RTX 8000-48GB GPUs (Sec. 1, Table 3)

A fused GPU kernel accelerates LeanQuant's affine-grid learning massively.

NumbersEnd-to-end quantization of Llama-3-8B 4-bit: 15.1 hrs → 0.27 hrs (>50× speedup) (Table 5)

LeanQuant achieves lower or comparable perplexity to strong baselines at the same bit widths.

Numbers4-bit avg perplexity LeanQuant aff 5.904 vs GPTQ 6.008; LeanQuant nu avg 5.824 ~ SqueezeLLM 5.818 (Table 7)

Results

Accuracy

ValueLeanQuant aff +18.38% vs GPTQ; +17.18% vs OmniQuant

BaselineGPTQ / OmniQuant

Perplexity (4-bit avg across models)

ValueLeanQuant aff avg 5.904 (lower is better)

BaselineGPTQ avg 6.008

Quantization time (Llama-3.1-405B)

Value≈ 20.7–21 hours

BaselineGPTQ OOM / OmniQuant OOM

GPU memory peak (4-bit)

ValueLeanQuant 65.4 GB for Llama-3.1-405B

BaselineGPTQ OOM; OmniQuant OOM; SqueezeLLM OOM

Fused kernel speedup

Value>50× end-to-end speedup for grid learning

BaselineWithout fused kernel

Who Should Care

What To Try In 7 Days

Run LeanQuant aff on a 4-bit copy of your model and compare zero-shot accuracy and perplexity to GPTQ.

If using non-uniform quantization, try LeanQuant nu and measure decoding latency with your inference stack.

Adopt the fused GPU kernel or use the provided implementation to cut quantization time from days to hours on 48GB GPUs.

Optimization Features

Token Efficiency

  • reduced memory reads lowers decoding latency

Infra Optimization

  • scales to 123B on single 48GB GPU and 405B on 2×48GB
  • reduces need for 80GB GPU clusters

Model Optimization

  • quantization
  • loss-error-aware grid learning
  • group-wise and row-wise quantization

System Optimization

  • works with standard affine and non-uniform formats
  • enumerative affine search for robust S,Z selection

Inference Optimization

  • compatibility with affine and non-uniform kernels
  • fused GPU kernel for grid search
  • dedicated CUDA kernel for non-uniform inference

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • 2-bit quantization remains challenging and sometimes yields high perplexity or unstable results (Table 7, Table 14).
  • Non-uniform formats need lookup-style kernels for efficient inference; extra kernel work is required for deployment (Sec. 3.2.1, J).
  • Method relies on Hessian-diagonal estimates from a calibration set; small calibration sets could lead to noisy estimates.

When Not To Use

  • When you require extreme 1–2 bit quantization with guaranteed stability; LeanQuant shows mixed 2-bit results.
  • If you cannot add small calibration runs (they compute inverse-Hessian diagonals).
  • When your inference stack cannot support affine or supported non-uniform lookup formats and you cannot add kernels.

Failure Modes

  • Noisy or unrepresentative calibration inputs produce poor Hessian diagonal estimates and suboptimal grids.
  • Very low bit widths (2-bit) can still cause large loss increases and high perplexity.
  • Lack of optimized inference kernels for chosen non-uniform formats can negate runtime benefits.

Core Entities

Models

  • Llama-3.1-405B
  • Llama-3-8B
  • Llama-2-7B
  • Llama-2-13B
  • Llama-3-70B
  • Mistral-Large-123B
  • Mistral-7B
  • LLaMA-7B
  • LLaMA-13B
  • BERT (experiments)

Metrics

  • Accuracy
  • perplexity
  • F1 (SQuAD)
  • sum of loss errors (ϵ)
  • GPU memory (peak)
  • quantization time

Datasets

  • C4
  • WikiText2
  • SQuAD
  • LM-evaluation-harness tasks (ARC,LAMBADA,MMLU,HellaSwag,PIQA,Winogrande)
  • MT-Bench (judged by GPT-4o)

Benchmarks

  • ARC
  • LAMBADA
  • MMLU
  • HellaSwag
  • PIQA
  • WinoGrande
  • MT-Bench