Atom: 4-bit weight+activation quantization that boosts LLM serving throughput up to 7.7× with minimal accuracy loss

October 29, 20238 min

Overview

Decision SnapshotNeeds Validation

Implemented kernels and end-to-end serving results show practical gains on NVIDIA GPUs; but Atom needs custom fused kernels, calibration, and offline weight processing which increases engineering effort.

Citations23

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

Links

Abstract / PDF

Why It Matters For Business

Atom can multiply token throughput per GPU and shrink KV-cache memory, lowering cloud GPU costs or increasing capacity without major task-accuracy loss.

Who Should Care

Summary TLDR

Atom is a post-training quantization method and serving workflow that uses mixed-precision, per-group quantization, dynamic activation quantization, channel reordering, and KV-cache quantization. On Llama-family models Atom runs 4-bit weight+activation (W4A4) inference using custom fused GPU kernels and achieves up to 7.73× token throughput vs FP16 and 2.53× vs INT8 while keeping accuracy losses small (≤1.4% average zero-shot drop; ≈0.3 WikiText2 PPL rise on Llama-65B). Atom requires offline calibration and custom kernels but integrates end-to-end into a serving stack.

Problem Statement

LLM serving needs higher throughput and lower memory use. Existing weight-only quantization still forces FP math and can’t use low-bit hardware efficiently. Prior weight-activation methods at 4-bit lose accuracy. The problem: get practical, low-bit (e.g., INT4) weight+activation quantization that (1) runs on modern GPUs, (2) keeps high accuracy, and (3) increases serving throughput.

Main Contribution

Atom: a low-bit weight+activation quantization recipe combining mixed-precision (keep outlier channels at higher bits), fine-grained group quantization, dynamic activation quantization, and channel reordering to enable efficient mixed-precision kernels.

System and kernel co-design: fused GEMM and fused FlashInfer operators plus quantized KV-cache to reduce memory movement and exploit INT4/INT8 Tensor Cores.

Key Findings

Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.

Numbers7.73× vs FP16; 2.53× vs INT8 (paper Figure 10 / §5.3.2)

Practical UseDeploying Atom's W4A4 kernels can multiply token-generation throughput multiple-fold on supported GPUs, letting you serve more requests per GPU or reduce GPU count.

Evidence Ref§5.3.2, Figure 10

Accuracy degradation at W4A4 is small: average zero-shot drop ≤1.4% and WikiText2 perplexity rise ≈0.3 for Llama-65B.

Numbers≤1.4% avg zero-shot drop; ≈0.3 WikiText2 PPL increase on Llama-65B (abstract, §5.2)

Practical UseYou can quantize large Llama models to 4-bit and keep near-baseline task accuracy for many zero-shot tasks.

Evidence RefAbstract; §5.2, Table 1–2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
End-to-end throughputUp to 7.73× vs FP16; 2.53× vs INT8FP16 / INT87.73× and 2.53×Serving workloads (ShareGPT traces) (§5.3.2)Figure 10; §5.3.2Figure 10
Accuracy≤1.4% average dropFP16≤1.4% absolutelm-eval tasks (PIQA, ARC, BoolQ, HellaSwag, WinoGrande) (§5.2, Table 1)Table 1; §5.2Table 1

What To Try In 7 Days

Run Atom-style W4A4 post-training quantization on a small Llama model (7B) with 128-sample calibration to measure latency and accuracy changes.

Profile dense GEMM and attention kernels to compare FP16 vs INT8 vs W4A4 and confirm kernel-level speedups.

Quantize KV-cache and test larger batch sizes to see if you can serve more concurrent sessions within existing GPU RAM limits.

Optimization Features

Token Efficiency
larger batch sizes enabled by KV quantizationreduced memory movement for KV-cache
Infra Optimization
use INT4/INT8 Tensor Cores on modern NVIDIA GPUstested on RTX 4090 and RTX Ada 6000
Model Optimization
mixed-precision for outliers (keep 128 channels at INT8/FP16)fine-grained per-group quantization (group size 128)GPTQ for offline weight quantization
System Optimization
kernel fusion to hide quantization overheadreorder weights offline to maintain regular memory access
Inference Optimization
dynamic activation quantization (runtime stats)channel reordering to pack outliersfused GEMM and fused FlashInfer kernelsquantized KV-cache (asymmetric for KV)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Requires custom fused GPU kernels and kernel maintenance across GPU generations.

Group quantization fusion introduces nontrivial kernel overhead that needs hardware support to fully amortize.

When Not To Use

You cannot modify or ship custom GPU kernels in your deployment.

You need exact FP16 numeric fidelity for a downstream task.

Failure Modes

If outliers are misidentified, quantization error can spike and break accuracy (W4A4 RTN showed huge PPL without outlier handling).

Group dequantization overhead can reduce theoretical kernel TOPS if not fused.

Core Entities

Models

Llama-7BLlama-13BLlama-30BLlama-65BLlama-2 (7B/13B/70B)Mixtral (8x7B)

Metrics

tokens per secondaverage decode latency per token (ms)perplexityAccuracyTOPS (kernel throughput)

Datasets

WikiText2PTBC4lm-eval (PIQA, ARC, BoolQ, HellaSwag, WinoGrande)ShareGPT (workload traces)

Benchmarks

Accuracyperplexitythroughput (tokens/s)decode latency (ms/token)