Fine-grained, weight-only quantization that cuts model size and boosts LLM throughput up to 3.65×

August 16, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

Links

Abstract / PDF

Why It Matters For Business

FineQuant reduces LLM memory and inference cost with little accuracy loss. That lowers GPU counts and raises throughput, directly cutting serving cost for large models on A100-class nodes.

Summary TLDR

FineQuant is a simple post-training, weight-only quantization approach plus GPU kernels. It adaptively chooses quantization group sizes per weight matrix to avoid accuracy collapse from outliers, supports int8/int4/int3 weights with fp16/bf16 activations, and ships fused dequantize+GEMM kernels. On large models (OPT-175B) it reduces memory enough to run the model on 2 GPUs and raises throughput up to ~3.65× on A100 nodes with small impact to accuracy on evaluated tasks.

Problem Statement

Large LLMs are memory-bound at inference: weights dominate bandwidth during auto-regressive decoding. Existing quantization either needs costly calibration/training or loses accuracy. We need a simple, scalable weight-only quantization that keeps quality, reduces memory, and speeds up real GPU inference without extra training.

Main Contribution

Comprehensive analysis of low-bit, weight-only quantization behaviors in LLMs, including failure modes from outliers.

Adaptive fine-grained quantization: a heuristic to pick group sizes per matrix to avoid catastrophic accuracy drops.

Efficient fused GPU kernels (fused dequantize + GEMM) supporting fp16/bf16 activations with int8/int4 weights and block-wise scales.

Demonstration on large models (OPT up to 175B and internal MoE): large memory reduction and up to 3.65× throughput on A100 nodes, with small accuracy loss on tested benchmarks.

Key Findings

Adaptive fine-grained grouping prevents catastrophic accuracy drops from low-bit quantization.

NumbersRecovered >94% lost BLEU by doubling granularity for four matrices (Section 3.3, Fig.3)

INT4 block quantization (block=64) yields large memory cuts with small accuracy loss on dense models.

NumbersModel size ~26% of FP16 while BLEU drop ~0.1% (Fig.3b)

Fused int4/int8 weight × fp16/bf16 activation kernels speed memory-bound GEMMs.

NumbersUp to 2.5× GEMM speedup for OPT-13B/30B when rows small (Fig.4)

End-to-end, weight-only quantization enables higher throughput and fewer GPUs for large models.

NumbersUp to 3.65× throughput on an 8-GPU node; OPT-175B deployable on 2 GPUs vs 8 (Tables 3–4)

Results

Throughput (generated tokens/sec)

ValueINT4 (64): up to 91 tks/sec vs FP16 25 (3.64×)

BaselineFP16 throughput per 8-GPU node

Model footprint (GB)

ValueOPT-IML Max 175B: FP16 324.16GB → INT4 (64) 86.23GB

BaselineFP16

BLEU change (MoE 5.3B)

Valueint3: 46.01 vs fp16 46.35 (−0.34 BLEU)

Baselinefp16 BLEU

Perplexity (OPT-175B)

ValueFP16 9.08 → INT4 per-col 11.08 → INT4 (64) 9.84

BaselineFP16

GEMM speedup (matrix multiply)

ValueUp to 2.5× for OPT-13B/30B when activation rows small

BaselineFP16 GEMM

Who Should Care

What To Try In 7 Days

Run per-column INT8 weight-only quantization on a production model and compare accuracy and memory.

Apply INT4 block quantization (block=64) with adaptive grouping on a single large model and measure decoder latency and throughput.

Replace FP16 GEMMs with the open-source fused dequant+GEMM kernels for memory-bound decoding steps and measure speedup.

Agent Features

Memory

  • weight-only compression reduces weight HBM traffic

Frameworks

  • FasterTransformer
  • CUTLASS

Architectures

  • dense
  • MoE

Optimization Features

Infra Optimization

  • serving 4 instances of OPT-175B on an 8-GPU A100 node (with INT4)

Model Optimization

  • weight-only post-training quantization
  • per-column scaling
  • block-wise (group) quantization (block size 64)
  • adaptive fine-grained group selection

System Optimization

  • CUTLASS-based kernels optimized for A100
  • replication strategy to increase per-node throughput

Inference Optimization

  • fused on-the-fly dequantize + GEMM GPU kernels
  • support for fp16/bf16 activations × int8/int4 weights
  • reduce GPU count by lowering weight memory footprint

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Optimized GPU kernels currently target block size 64 only.
  • Benchmarks run on NVIDIA A100; results may differ on other GPUs.
  • Fused kernels dequantize to fp16/bf16 and do not exploit integer compute paths.
  • Some models suffer catastrophic collapse with naive per-column INT4; adaptive grouping is required and heuristic tuned in paper.

When Not To Use

  • Compute-bound workloads where activation-side compute dominates.
  • Hardware that lacks similar tensor-core behavior to A100 or where integer instructions are preferred.
  • When you cannot deploy custom CUDA kernels in your serving environment.

Failure Modes

  • Catastrophic accuracy collapse from per-column INT4 when outliers exist (OPT-66B example).
  • Quality sensitive matrices need finer groups; wrong granularity causes major BLEU loss.
  • Compute-bound phases (context creation) can be slower with dequant overhead.

Core Entities

Models

  • OPT-175B
  • OPT-66B
  • OPT-30B
  • OPT-13B
  • GPT2-XL
  • OPT-IML (30B,175B)
  • Internal MoE 5.3B

Metrics

  • BLEU
  • perplexity
  • tokens/sec
  • GB model footprint
  • ms per decoder step
  • GEMM speedup

Datasets

  • WMT16 (De-En)
  • WMT2016
  • lm-evaluation-harness tasks (LAMBADA,HellaSwag,PiQA,WinoGrande,OpenBookQA,RTE,COPA)
  • wikitext

Benchmarks

  • BLEU
  • perplexity
  • throughput (tokens/sec)
  • GEMM speedup (×)
  • avg decoder-step latency (ms)