Fine-grained, weight-only quantization that cuts model size and boosts LLM throughput up to 3.65×

Overview

Decision SnapshotReady For Pilot

Method is post-training and uses open kernels. Results show meaningful memory and throughput gains on A100 GPUs, but kernels are currently tuned for block size 64 and A100 — expect engineering work to generalize.

Citations2

Evidence Strength0.75

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FineQuant reduces LLM memory and inference cost with little accuracy loss. That lowers GPU counts and raises throughput, directly cutting serving cost for large models on A100-class nodes.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

FineQuant is a simple post-training, weight-only quantization approach plus GPU kernels. It adaptively chooses quantization group sizes per weight matrix to avoid accuracy collapse from outliers, supports int8/int4/int3 weights with fp16/bf16 activations, and ships fused dequantize+GEMM kernels. On large models (OPT-175B) it reduces memory enough to run the model on 2 GPUs and raises throughput up to ~3.65× on A100 nodes with small impact to accuracy on evaluated tasks.

Problem Statement

Large LLMs are memory-bound at inference: weights dominate bandwidth during auto-regressive decoding. Existing quantization either needs costly calibration/training or loses accuracy. We need a simple, scalable weight-only quantization that keeps quality, reduces memory, and speeds up real GPU inference without extra training.

Main Contribution

Comprehensive analysis of low-bit, weight-only quantization behaviors in LLMs, including failure modes from outliers.

Adaptive fine-grained quantization: a heuristic to pick group sizes per matrix to avoid catastrophic accuracy drops.

Key Findings

Adaptive fine-grained grouping prevents catastrophic accuracy drops from low-bit quantization.

NumbersRecovered >94% lost BLEU by doubling granularity for four matrices (Section 3.3, Fig.3)

Practical UseWhen quantizing, measure per-matrix range spread and increase group granularity for outlier-affected matrices to avoid quality collapse.

Evidence RefSection 3.3, Fig.3

INT4 block quantization (block=64) yields large memory cuts with small accuracy loss on dense models.

NumbersModel size ~26% of FP16 while BLEU drop ~0.1% (Fig.3b)

Practical UseUse block-wise INT4 (64) plus adaptive grouping to cut weight memory ≈4× while keeping task quality close to FP16 on evaluated tasks.

Evidence RefFig.3b and Table 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Throughput (generated tokens/sec)	INT4 (64): up to 91 tks/sec vs FP16 25 (3.64×)	FP16 throughput per 8-GPU node	3.64×	Table 4, input128 output128	Table 4 (128/128 row)	Table 4
Model footprint (GB)	OPT-IML Max 175B: FP16 324.16GB → INT4 (64) 86.23GB	FP16	~3.76× smaller (≈26%)	Table 5	Table 5 OPT-IML Max 175B sizes	Table 5

What To Try In 7 Days

Run per-column INT8 weight-only quantization on a production model and compare accuracy and memory.

Apply INT4 block quantization (block=64) with adaptive grouping on a single large model and measure decoder latency and throughput.

Replace FP16 GEMMs with the open-source fused dequant+GEMM kernels for memory-bound decoding steps and measure speedup.

Agent Features

Memory

weight-only compression reduces weight HBM traffic

Frameworks

FasterTransformerCUTLASS

Architectures

denseMoE

Optimization Features

Infra Optimization

serving 4 instances of OPT-175B on an 8-GPU A100 node (with INT4)

Model Optimization

weight-only post-training quantizationper-column scalingblock-wise (group) quantization (block size 64)adaptive fine-grained group selection

System Optimization

CUTLASS-based kernels optimized for A100replication strategy to increase per-node throughput

Inference Optimization

fused on-the-fly dequantize + GEMM GPU kernelssupport for fp16/bf16 activations × int8/int4 weightsreduce GPU count by lowering weight memory footprint

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/NVIDIA/FasterTransformer

Data URLs

https://statmt.org/wmt16/https://github.com/EleutherAI/lm-evaluation-harness https://github.com/mjpost/sacrebleu

Risks & Boundaries

Limitations

Optimized GPU kernels currently target block size 64 only.

Benchmarks run on NVIDIA A100; results may differ on other GPUs.

When Not To Use

Compute-bound workloads where activation-side compute dominates.

Hardware that lacks similar tensor-core behavior to A100 or where integer instructions are preferred.

Failure Modes

Catastrophic accuracy collapse from per-column INT4 when outliers exist (OPT-66B example).

Quality sensitive matrices need finer groups; wrong granularity causes major BLEU loss.

Core Entities

Models

OPT-175BOPT-66BOPT-30BOPT-13BGPT2-XLOPT-IML (30B,175B)Internal MoE 5.3B

Metrics

BLEUperplexitytokens/secGB model footprintms per decoder stepGEMM speedup

Datasets

WMT16 (De-En)WMT2016lm-evaluation-harness tasks (LAMBADA,HellaSwag,PiQA,WinoGrande,OpenBookQA,RTE,COPA)wikitext

Benchmarks

BLEUperplexitythroughput (tokens/sec)GEMM speedup (×)avg decoder-step latency (ms)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adaptive fine-grained grouping prevents catastrophic accuracy drops from low-bit quantization.

INT4 block quantization (block=64) yields large memory cuts with small accuracy loss on dense models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding