Learn quantization grids that pay attention to loss sensitivity, enabling accurate 2–4-bit LLM compression at large scale

Overview

Decision SnapshotNeeds Validation

The method is practical: it reuses standard quant formats, includes a fast fused kernel, and shows reproducible gains on many LLMs and real benchmarks; remaining risks are 2-bit stability and implementation of inference kernels for non-uniform formats.

Citations0

Evidence Strength0.70

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Tianyi Zhang, Anshumali Shrivastava

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LeanQuant lets teams compress state-of-the-art LLMs to 2–4 bits with less accuracy loss and on common 48GB GPUs, cutting model memory and inference cost while remaining compatible with standard kernels.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager Founder Data Scientist

Summary TLDR

LeanQuant changes how post-training quantization chooses grid points: instead of min–max affine grids, it learns grids that prioritize weights with large loss sensitivity (inverse Hessian outliers). The method works for both affine and non-uniform formats, reduces quantization-induced loss error, and scales to very large models (quantized Llama-3.1-405B on two 48GB GPUs in ~21 hours). It is compatible with standard inference kernels and comes with a fused GPU kernel to make grid learning practical.

Problem Statement

Popular iterative loss-error quantizers use min–max affine grids that miss critical outlier weights (large inverse-Hessian diagonals). This causes high loss error and quality drop, and many accurate alternatives need custom data formats or large hardware, limiting compatibility and scalability for very large LLMs.

Main Contribution

Identify that min–max affine grids poorly preserve precision for inverse-Hessian outliers and cause high loss error during iterative quantization.

Introduce loss-error-aware grid learning that weights parameters by inverse-Hessian diagonals to place grid points where they reduce task loss most.

Key Findings

LeanQuant reduces task loss error and improves accuracy compared to GPTQ and other baselines in low-bit regimes.

Numbers3-bit Llama-3-8B avg zero-shot accuracy +18.38% vs GPTQ; +17.18% vs OmniQuant (Table 1)

Practical UseUse LeanQuant for 3-bit quantization to recover significantly more accuracy than GPTQ/OmniQuant on evaluated benchmarks.

Evidence RefTable 1, Sec. 4.1

LeanQuant scales to very large LLMs with practical GPU resources and time.

NumbersQuantized Llama-3.1-405B in ~21 hours using 2× Quadro RTX 8000-48GB GPUs (Sec. 1, Table 3)

Practical UseYou can quantize models up to 405B on commodity 48GB-GPU setups for deployment without requiring clusters of 80GB GPUs.

Evidence RefSec. 1, Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	LeanQuant aff +18.38% vs GPTQ; +17.18% vs OmniQuant	GPTQ / OmniQuant	+18.38% / +17.18%	11 tasks in Table 1	Sec. 4.1, Table 1	Table 1
Perplexity (4-bit avg across models)	LeanQuant aff avg 5.904 (lower is better)	GPTQ avg 6.008	-0.104 (avg perplexity)	WikiText2 & C4 (Table 7)	Table 7, Sec. 4.1	Table 7

What To Try In 7 Days

Run LeanQuant aff on a 4-bit copy of your model and compare zero-shot accuracy and perplexity to GPTQ.

If using non-uniform quantization, try LeanQuant nu and measure decoding latency with your inference stack.

Adopt the fused GPU kernel or use the provided implementation to cut quantization time from days to hours on 48GB GPUs.

Optimization Features

Token Efficiency

reduced memory reads lowers decoding latency

Infra Optimization

scales to 123B on single 48GB GPU and 405B on 2×48GBreduces need for 80GB GPU clusters

Model Optimization

quantizationloss-error-aware grid learninggroup-wise and row-wise quantization

System Optimization

works with standard affine and non-uniform formatsenumerative affine search for robust S,Z selection

Inference Optimization

compatibility with affine and non-uniform kernelsfused GPU kernel for grid searchdedicated CUDA kernel for non-uniform inference

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/LeanModels/LeanQuant

Data URLs

https://www.tensorflow.org/datasets/catalog/c4 https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

Risks & Boundaries

Limitations

2-bit quantization remains challenging and sometimes yields high perplexity or unstable results (Table 7, Table 14).

Non-uniform formats need lookup-style kernels for efficient inference; extra kernel work is required for deployment (Sec. 3.2.1, J).

When Not To Use

When you require extreme 1–2 bit quantization with guaranteed stability; LeanQuant shows mixed 2-bit results.

If you cannot add small calibration runs (they compute inverse-Hessian diagonals).

Failure Modes

Noisy or unrepresentative calibration inputs produce poor Hessian diagonal estimates and suboptimal grids.

Very low bit widths (2-bit) can still cause large loss increases and high perplexity.

Core Entities

Models

Llama-3.1-405BLlama-3-8BLlama-2-7BLlama-2-13BLlama-3-70BMistral-Large-123BMistral-7BLLaMA-7BLLaMA-13BBERT (experiments)

Metrics

AccuracyperplexityF1 (SQuAD)sum of loss errors (ϵ)GPU memory (peak)quantization time

Datasets

C4WikiText2SQuADLM-evaluation-harness tasks (ARC,LAMBADA,MMLU,HellaSwag,PIQA,Winogrande)MT-Bench (judged by GPT-4o)

Benchmarks

ARCLAMBADAMMLUHellaSwagPIQAWinoGrandeMT-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LeanQuant reduces task loss error and improves accuracy compared to GPTQ and other baselines in low-bit regimes.

LeanQuant scales to very large LLMs with practical GPU resources and time.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding