Overview
The method is practical: it reuses standard quant formats, includes a fast fused kernel, and shows reproducible gains on many LLMs and real benchmarks; remaining risks are 2-bit stability and implementation of inference kernels for non-uniform formats.
Citations0
Evidence Strength0.70
Confidence0.88
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
LeanQuant lets teams compress state-of-the-art LLMs to 2–4 bits with less accuracy loss and on common 48GB GPUs, cutting model memory and inference cost while remaining compatible with standard kernels.
Who Should Care
Summary TLDR
LeanQuant changes how post-training quantization chooses grid points: instead of min–max affine grids, it learns grids that prioritize weights with large loss sensitivity (inverse Hessian outliers). The method works for both affine and non-uniform formats, reduces quantization-induced loss error, and scales to very large models (quantized Llama-3.1-405B on two 48GB GPUs in ~21 hours). It is compatible with standard inference kernels and comes with a fused GPU kernel to make grid learning practical.
Problem Statement
Popular iterative loss-error quantizers use min–max affine grids that miss critical outlier weights (large inverse-Hessian diagonals). This causes high loss error and quality drop, and many accurate alternatives need custom data formats or large hardware, limiting compatibility and scalability for very large LLMs.
Main Contribution
Identify that min–max affine grids poorly preserve precision for inverse-Hessian outliers and cause high loss error during iterative quantization.
Introduce loss-error-aware grid learning that weights parameters by inverse-Hessian diagonals to place grid points where they reduce task loss most.
Key Findings
LeanQuant reduces task loss error and improves accuracy compared to GPTQ and other baselines in low-bit regimes.
LeanQuant scales to very large LLMs with practical GPU resources and time.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | LeanQuant aff +18.38% vs GPTQ; +17.18% vs OmniQuant | GPTQ / OmniQuant | +18.38% / +17.18% | 11 tasks in Table 1 | Sec. 4.1, Table 1 | Table 1 |
| Perplexity (4-bit avg across models) | LeanQuant aff avg 5.904 (lower is better) | GPTQ avg 6.008 | -0.104 (avg perplexity) | WikiText2 & C4 (Table 7) | Table 7, Sec. 4.1 | Table 7 |
What To Try In 7 Days
Run LeanQuant aff on a 4-bit copy of your model and compare zero-shot accuracy and perplexity to GPTQ.
If using non-uniform quantization, try LeanQuant nu and measure decoding latency with your inference stack.
Adopt the fused GPU kernel or use the provided implementation to cut quantization time from days to hours on 48GB GPUs.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
2-bit quantization remains challenging and sometimes yields high perplexity or unstable results (Table 7, Table 14).
Non-uniform formats need lookup-style kernels for efficient inference; extra kernel work is required for deployment (Sec. 3.2.1, J).
When Not To Use
When you require extreme 1–2 bit quantization with guaranteed stability; LeanQuant shows mixed 2-bit results.
If you cannot add small calibration runs (they compute inverse-Hessian diagonals).
Failure Modes
Noisy or unrepresentative calibration inputs produce poor Hessian diagonal estimates and suboptimal grids.
Very low bit widths (2-bit) can still cause large loss increases and high perplexity.

