Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
FineQuant reduces LLM memory and inference cost with little accuracy loss. That lowers GPU counts and raises throughput, directly cutting serving cost for large models on A100-class nodes.
Summary TLDR
FineQuant is a simple post-training, weight-only quantization approach plus GPU kernels. It adaptively chooses quantization group sizes per weight matrix to avoid accuracy collapse from outliers, supports int8/int4/int3 weights with fp16/bf16 activations, and ships fused dequantize+GEMM kernels. On large models (OPT-175B) it reduces memory enough to run the model on 2 GPUs and raises throughput up to ~3.65× on A100 nodes with small impact to accuracy on evaluated tasks.
Problem Statement
Large LLMs are memory-bound at inference: weights dominate bandwidth during auto-regressive decoding. Existing quantization either needs costly calibration/training or loses accuracy. We need a simple, scalable weight-only quantization that keeps quality, reduces memory, and speeds up real GPU inference without extra training.
Main Contribution
Comprehensive analysis of low-bit, weight-only quantization behaviors in LLMs, including failure modes from outliers.
Adaptive fine-grained quantization: a heuristic to pick group sizes per matrix to avoid catastrophic accuracy drops.
Efficient fused GPU kernels (fused dequantize + GEMM) supporting fp16/bf16 activations with int8/int4 weights and block-wise scales.
Demonstration on large models (OPT up to 175B and internal MoE): large memory reduction and up to 3.65× throughput on A100 nodes, with small accuracy loss on tested benchmarks.
Key Findings
Adaptive fine-grained grouping prevents catastrophic accuracy drops from low-bit quantization.
INT4 block quantization (block=64) yields large memory cuts with small accuracy loss on dense models.
Fused int4/int8 weight × fp16/bf16 activation kernels speed memory-bound GEMMs.
End-to-end, weight-only quantization enables higher throughput and fewer GPUs for large models.
Results
Throughput (generated tokens/sec)
Model footprint (GB)
BLEU change (MoE 5.3B)
Perplexity (OPT-175B)
GEMM speedup (matrix multiply)
Who Should Care
What To Try In 7 Days
Run per-column INT8 weight-only quantization on a production model and compare accuracy and memory.
Apply INT4 block quantization (block=64) with adaptive grouping on a single large model and measure decoder latency and throughput.
Replace FP16 GEMMs with the open-source fused dequant+GEMM kernels for memory-bound decoding steps and measure speedup.
Agent Features
Memory
- weight-only compression reduces weight HBM traffic
Frameworks
- FasterTransformer
- CUTLASS
Architectures
- dense
- MoE
Optimization Features
Infra Optimization
- serving 4 instances of OPT-175B on an 8-GPU A100 node (with INT4)
Model Optimization
- weight-only post-training quantization
- per-column scaling
- block-wise (group) quantization (block size 64)
- adaptive fine-grained group selection
System Optimization
- CUTLASS-based kernels optimized for A100
- replication strategy to increase per-node throughput
Inference Optimization
- fused on-the-fly dequantize + GEMM GPU kernels
- support for fp16/bf16 activations × int8/int4 weights
- reduce GPU count by lowering weight memory footprint
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Optimized GPU kernels currently target block size 64 only.
- Benchmarks run on NVIDIA A100; results may differ on other GPUs.
- Fused kernels dequantize to fp16/bf16 and do not exploit integer compute paths.
- Some models suffer catastrophic collapse with naive per-column INT4; adaptive grouping is required and heuristic tuned in paper.
When Not To Use
- Compute-bound workloads where activation-side compute dominates.
- Hardware that lacks similar tensor-core behavior to A100 or where integer instructions are preferred.
- When you cannot deploy custom CUDA kernels in your serving environment.
Failure Modes
- Catastrophic accuracy collapse from per-column INT4 when outliers exist (OPT-66B example).
- Quality sensitive matrices need finer groups; wrong granularity causes major BLEU loss.
- Compute-bound phases (context creation) can be slower with dequant overhead.
Core Entities
Models
- OPT-175B
- OPT-66B
- OPT-30B
- OPT-13B
- GPT2-XL
- OPT-IML (30B,175B)
- Internal MoE 5.3B
Metrics
- BLEU
- perplexity
- tokens/sec
- GB model footprint
- ms per decoder step
- GEMM speedup
Datasets
- WMT16 (De-En)
- WMT2016
- lm-evaluation-harness tasks (LAMBADA,HellaSwag,PiQA,WinoGrande,OpenBookQA,RTE,COPA)
- wikitext
Benchmarks
- BLEU
- perplexity
- throughput (tokens/sec)
- GEMM speedup (×)
- avg decoder-step latency (ms)

