Atom: 4-bit weight+activation quantization that boosts LLM serving throughput up to 7.7× with minimal accuracy loss

October 29, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

23

Authors

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

Links

Abstract / PDF

Why It Matters For Business

Atom can multiply token throughput per GPU and shrink KV-cache memory, lowering cloud GPU costs or increasing capacity without major task-accuracy loss.

Summary TLDR

Atom is a post-training quantization method and serving workflow that uses mixed-precision, per-group quantization, dynamic activation quantization, channel reordering, and KV-cache quantization. On Llama-family models Atom runs 4-bit weight+activation (W4A4) inference using custom fused GPU kernels and achieves up to 7.73× token throughput vs FP16 and 2.53× vs INT8 while keeping accuracy losses small (≤1.4% average zero-shot drop; ≈0.3 WikiText2 PPL rise on Llama-65B). Atom requires offline calibration and custom kernels but integrates end-to-end into a serving stack.

Problem Statement

LLM serving needs higher throughput and lower memory use. Existing weight-only quantization still forces FP math and can’t use low-bit hardware efficiently. Prior weight-activation methods at 4-bit lose accuracy. The problem: get practical, low-bit (e.g., INT4) weight+activation quantization that (1) runs on modern GPUs, (2) keeps high accuracy, and (3) increases serving throughput.

Main Contribution

Atom: a low-bit weight+activation quantization recipe combining mixed-precision (keep outlier channels at higher bits), fine-grained group quantization, dynamic activation quantization, and channel reordering to enable efficient mixed-precision kernels.

System and kernel co-design: fused GEMM and fused FlashInfer operators plus quantized KV-cache to reduce memory movement and exploit INT4/INT8 Tensor Cores.

Comprehensive evaluation on Llama-family models showing large throughput gains with small accuracy degradation, plus ablations quantifying each ingredient's impact.

Key Findings

Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.

Numbers7.73× vs FP16; 2.53× vs INT8 (paper Figure 10 / §5.3.2)

Accuracy degradation at W4A4 is small: average zero-shot drop ≤1.4% and WikiText2 perplexity rise ≈0.3 for Llama-65B.

Numbers≤1.4% avg zero-shot drop; ≈0.3 WikiText2 PPL increase on Llama-65B (abstract, §5.2)

Kernel-level speedups: fused GEMM at batch 512 gives 3.4× over FP16 and 1.9× over INT8; self-attention at batch 128 gives 3.5× over FP16 and 1.8× over INT8.

NumbersGEMM: 3.4× (vs FP16) & 1.9× (vs INT8) at batch 512; Self-attn: 3.5× & 1.8× at batch 128 (§5.3.1)

Outlier handling and group quantization are the main accuracy drivers: keeping 128 outlier channels in high precision reduces W4A4 PPL from thousands to ≈11 on Llama-7B; group quantization then reduces it to ≈6.

NumbersW4A4 RTN PPL 2315 → keep 128 outliers FP16 PPL 11.34 → +group size 128 → PPL 6.22 (Table 3, §5.4.1)

Results

End-to-end throughput

ValueUp to 7.73× vs FP16; 2.53× vs INT8

BaselineFP16 / INT8

Accuracy

Value≤1.4% average drop

BaselineFP16

Perplexity change (Llama-65B, W4A4)

Value≈+0.3 on WikiText2

BaselineFP16

Kernel GEMM throughput

Value3.4× vs FP16; 1.9× vs INT8 at batch 512

BaselineFP16 / INT8 kernels

Self-attention speedup

Value3.5× vs FP16; 1.8× vs INT8 at batch 128

BaselineFP16 / INT8

Offline quantization time

Value≈4 hours for Llama-65B on RTX Ada 6000

Baselinen/a

Who Should Care

What To Try In 7 Days

Run Atom-style W4A4 post-training quantization on a small Llama model (7B) with 128-sample calibration to measure latency and accuracy changes.

Profile dense GEMM and attention kernels to compare FP16 vs INT8 vs W4A4 and confirm kernel-level speedups.

Quantize KV-cache and test larger batch sizes to see if you can serve more concurrent sessions within existing GPU RAM limits.

Optimization Features

Token Efficiency

  • larger batch sizes enabled by KV quantization
  • reduced memory movement for KV-cache

Infra Optimization

  • use INT4/INT8 Tensor Cores on modern NVIDIA GPUs
  • tested on RTX 4090 and RTX Ada 6000

Model Optimization

  • mixed-precision for outliers (keep 128 channels at INT8/FP16)
  • fine-grained per-group quantization (group size 128)
  • GPTQ for offline weight quantization

System Optimization

  • kernel fusion to hide quantization overhead
  • reorder weights offline to maintain regular memory access

Inference Optimization

  • dynamic activation quantization (runtime stats)
  • channel reordering to pack outliers
  • fused GEMM and fused FlashInfer kernels
  • quantized KV-cache (asymmetric for KV)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires custom fused GPU kernels and kernel maintenance across GPU generations.
  • Group quantization fusion introduces nontrivial kernel overhead that needs hardware support to fully amortize.
  • Evaluation focused on NVIDIA hardware (RTX 4090, Ada); other accelerators untested.
  • Offline preprocessing can be slow for very large models (≈4 hours for Llama-65B).

When Not To Use

  • You cannot modify or ship custom GPU kernels in your deployment.
  • You need exact FP16 numeric fidelity for a downstream task.
  • Target hardware lacks INT4/INT8 tensor core support or equivalent primitives.

Failure Modes

  • If outliers are misidentified, quantization error can spike and break accuracy (W4A4 RTN showed huge PPL without outlier handling).
  • Group dequantization overhead can reduce theoretical kernel TOPS if not fused.
  • Dynamic quantization without fusion may increase latency.

Core Entities

Models

  • Llama-7B
  • Llama-13B
  • Llama-30B
  • Llama-65B
  • Llama-2 (7B/13B/70B)
  • Mixtral (8x7B)

Metrics

  • tokens per second
  • average decode latency per token (ms)
  • perplexity
  • Accuracy
  • TOPS (kernel throughput)

Datasets

  • WikiText2
  • PTB
  • C4
  • lm-eval (PIQA, ARC, BoolQ, HellaSwag, WinoGrande)
  • ShareGPT (workload traces)

Benchmarks

  • Accuracy
  • perplexity
  • throughput (tokens/s)
  • decode latency (ms/token)