Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
LeanQuant lets teams compress state-of-the-art LLMs to 2–4 bits with less accuracy loss and on common 48GB GPUs, cutting model memory and inference cost while remaining compatible with standard kernels.
Summary TLDR
LeanQuant changes how post-training quantization chooses grid points: instead of min–max affine grids, it learns grids that prioritize weights with large loss sensitivity (inverse Hessian outliers). The method works for both affine and non-uniform formats, reduces quantization-induced loss error, and scales to very large models (quantized Llama-3.1-405B on two 48GB GPUs in ~21 hours). It is compatible with standard inference kernels and comes with a fused GPU kernel to make grid learning practical.
Problem Statement
Popular iterative loss-error quantizers use min–max affine grids that miss critical outlier weights (large inverse-Hessian diagonals). This causes high loss error and quality drop, and many accurate alternatives need custom data formats or large hardware, limiting compatibility and scalability for very large LLMs.
Main Contribution
Identify that min–max affine grids poorly preserve precision for inverse-Hessian outliers and cause high loss error during iterative quantization.
Introduce loss-error-aware grid learning that weights parameters by inverse-Hessian diagonals to place grid points where they reduce task loss most.
Provide algorithms for both non-uniform (weighted k-means with uniform initialization) and affine (enumerative search) loss-error-aware grids plus a fused GPU kernel to make grid search fast.
Demonstrate accuracy and scalability: outperforms or matches strong baselines across 2–4 bits and quantizes Llama-3.1-405B using two 48GB GPUs.
Key Findings
LeanQuant reduces task loss error and improves accuracy compared to GPTQ and other baselines in low-bit regimes.
LeanQuant scales to very large LLMs with practical GPU resources and time.
A fused GPU kernel accelerates LeanQuant's affine-grid learning massively.
LeanQuant achieves lower or comparable perplexity to strong baselines at the same bit widths.
Results
Accuracy
Perplexity (4-bit avg across models)
Quantization time (Llama-3.1-405B)
GPU memory peak (4-bit)
Fused kernel speedup
Who Should Care
What To Try In 7 Days
Run LeanQuant aff on a 4-bit copy of your model and compare zero-shot accuracy and perplexity to GPTQ.
If using non-uniform quantization, try LeanQuant nu and measure decoding latency with your inference stack.
Adopt the fused GPU kernel or use the provided implementation to cut quantization time from days to hours on 48GB GPUs.
Optimization Features
Token Efficiency
- reduced memory reads lowers decoding latency
Infra Optimization
- scales to 123B on single 48GB GPU and 405B on 2×48GB
- reduces need for 80GB GPU clusters
Model Optimization
- quantization
- loss-error-aware grid learning
- group-wise and row-wise quantization
System Optimization
- works with standard affine and non-uniform formats
- enumerative affine search for robust S,Z selection
Inference Optimization
- compatibility with affine and non-uniform kernels
- fused GPU kernel for grid search
- dedicated CUDA kernel for non-uniform inference
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- 2-bit quantization remains challenging and sometimes yields high perplexity or unstable results (Table 7, Table 14).
- Non-uniform formats need lookup-style kernels for efficient inference; extra kernel work is required for deployment (Sec. 3.2.1, J).
- Method relies on Hessian-diagonal estimates from a calibration set; small calibration sets could lead to noisy estimates.
When Not To Use
- When you require extreme 1–2 bit quantization with guaranteed stability; LeanQuant shows mixed 2-bit results.
- If you cannot add small calibration runs (they compute inverse-Hessian diagonals).
- When your inference stack cannot support affine or supported non-uniform lookup formats and you cannot add kernels.
Failure Modes
- Noisy or unrepresentative calibration inputs produce poor Hessian diagonal estimates and suboptimal grids.
- Very low bit widths (2-bit) can still cause large loss increases and high perplexity.
- Lack of optimized inference kernels for chosen non-uniform formats can negate runtime benefits.
Core Entities
Models
- Llama-3.1-405B
- Llama-3-8B
- Llama-2-7B
- Llama-2-13B
- Llama-3-70B
- Mistral-Large-123B
- Mistral-7B
- LLaMA-7B
- LLaMA-13B
- BERT (experiments)
Metrics
- Accuracy
- perplexity
- F1 (SQuAD)
- sum of loss errors (ϵ)
- GPU memory (peak)
- quantization time
Datasets
- C4
- WikiText2
- SQuAD
- LM-evaluation-harness tasks (ARC,LAMBADA,MMLU,HellaSwag,PIQA,Winogrande)
- MT-Bench (judged by GPT-4o)
Benchmarks
- ARC
- LAMBADA
- MMLU
- HellaSwag
- PIQA
- WinoGrande
- MT-Bench

