Fine-grained, weight-only quantization that cuts model size and boosts LLM throughput up to 3.65×

August 16, 20237 min

Overview

Decision SnapshotReady For Pilot

Method is post-training and uses open kernels. Results show meaningful memory and throughput gains on A100 GPUs, but kernels are currently tuned for block size 64 and A100 — expect engineering work to generalize.

Citations2

Evidence Strength0.75

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FineQuant reduces LLM memory and inference cost with little accuracy loss. That lowers GPU counts and raises throughput, directly cutting serving cost for large models on A100-class nodes.

Who Should Care

Summary TLDR

FineQuant is a simple post-training, weight-only quantization approach plus GPU kernels. It adaptively chooses quantization group sizes per weight matrix to avoid accuracy collapse from outliers, supports int8/int4/int3 weights with fp16/bf16 activations, and ships fused dequantize+GEMM kernels. On large models (OPT-175B) it reduces memory enough to run the model on 2 GPUs and raises throughput up to ~3.65× on A100 nodes with small impact to accuracy on evaluated tasks.

Problem Statement

Large LLMs are memory-bound at inference: weights dominate bandwidth during auto-regressive decoding. Existing quantization either needs costly calibration/training or loses accuracy. We need a simple, scalable weight-only quantization that keeps quality, reduces memory, and speeds up real GPU inference without extra training.

Main Contribution

Comprehensive analysis of low-bit, weight-only quantization behaviors in LLMs, including failure modes from outliers.

Adaptive fine-grained quantization: a heuristic to pick group sizes per matrix to avoid catastrophic accuracy drops.

Key Findings

Adaptive fine-grained grouping prevents catastrophic accuracy drops from low-bit quantization.

NumbersRecovered >94% lost BLEU by doubling granularity for four matrices (Section 3.3, Fig.3)

Practical UseWhen quantizing, measure per-matrix range spread and increase group granularity for outlier-affected matrices to avoid quality collapse.

Evidence RefSection 3.3, Fig.3

INT4 block quantization (block=64) yields large memory cuts with small accuracy loss on dense models.

NumbersModel size ~26% of FP16 while BLEU drop ~0.1% (Fig.3b)

Practical UseUse block-wise INT4 (64) plus adaptive grouping to cut weight memory ≈4× while keeping task quality close to FP16 on evaluated tasks.

Evidence RefFig.3b and Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Throughput (generated tokens/sec)INT4 (64): up to 91 tks/sec vs FP16 25 (3.64×)FP16 throughput per 8-GPU node3.64×Table 4, input128 output128Table 4 (128/128 row)Table 4
Model footprint (GB)OPT-IML Max 175B: FP16 324.16GB → INT4 (64) 86.23GBFP16~3.76× smaller (≈26%)Table 5Table 5 OPT-IML Max 175B sizesTable 5

What To Try In 7 Days

Run per-column INT8 weight-only quantization on a production model and compare accuracy and memory.

Apply INT4 block quantization (block=64) with adaptive grouping on a single large model and measure decoder latency and throughput.

Replace FP16 GEMMs with the open-source fused dequant+GEMM kernels for memory-bound decoding steps and measure speedup.

Agent Features

Memory
weight-only compression reduces weight HBM traffic
Frameworks
FasterTransformerCUTLASS
Architectures
denseMoE

Optimization Features

Infra Optimization
serving 4 instances of OPT-175B on an 8-GPU A100 node (with INT4)
Model Optimization
weight-only post-training quantizationper-column scalingblock-wise (group) quantization (block size 64)adaptive fine-grained group selection
System Optimization
CUTLASS-based kernels optimized for A100replication strategy to increase per-node throughput
Inference Optimization
fused on-the-fly dequantize + GEMM GPU kernelssupport for fp16/bf16 activations × int8/int4 weightsreduce GPU count by lowering weight memory footprint

Reproducibility

Risks & Boundaries

Limitations

Optimized GPU kernels currently target block size 64 only.

Benchmarks run on NVIDIA A100; results may differ on other GPUs.

When Not To Use

Compute-bound workloads where activation-side compute dominates.

Hardware that lacks similar tensor-core behavior to A100 or where integer instructions are preferred.

Failure Modes

Catastrophic accuracy collapse from per-column INT4 when outliers exist (OPT-66B example).

Quality sensitive matrices need finer groups; wrong granularity causes major BLEU loss.

Core Entities

Models

OPT-175BOPT-66BOPT-30BOPT-13BGPT2-XLOPT-IML (30B,175B)Internal MoE 5.3B

Metrics

BLEUperplexitytokens/secGB model footprintms per decoder stepGEMM speedup

Datasets

WMT16 (De-En)WMT2016lm-evaluation-harness tasks (LAMBADA,HellaSwag,PiQA,WinoGrande,OpenBookQA,RTE,COPA)wikitext

Benchmarks

BLEUperplexitythroughput (tokens/sec)GEMM speedup (×)avg decoder-step latency (ms)