Atom: 4-bit weight+activation quantization that boosts LLM serving throughput up to 7.7× with minimal accuracy loss

Overview

Decision SnapshotNeeds Validation

Implemented kernels and end-to-end serving results show practical gains on NVIDIA GPUs; but Atom needs custom fused kernels, calibration, and offline weight processing which increases engineering effort.

Citations23

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

Links

Abstract / PDF

Why It Matters For Business

Atom can multiply token throughput per GPU and shrink KV-cache memory, lowering cloud GPU costs or increasing capacity without major task-accuracy loss.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist Product Manager

Summary TLDR

Atom is a post-training quantization method and serving workflow that uses mixed-precision, per-group quantization, dynamic activation quantization, channel reordering, and KV-cache quantization. On Llama-family models Atom runs 4-bit weight+activation (W4A4) inference using custom fused GPU kernels and achieves up to 7.73× token throughput vs FP16 and 2.53× vs INT8 while keeping accuracy losses small (≤1.4% average zero-shot drop; ≈0.3 WikiText2 PPL rise on Llama-65B). Atom requires offline calibration and custom kernels but integrates end-to-end into a serving stack.

Problem Statement

LLM serving needs higher throughput and lower memory use. Existing weight-only quantization still forces FP math and can’t use low-bit hardware efficiently. Prior weight-activation methods at 4-bit lose accuracy. The problem: get practical, low-bit (e.g., INT4) weight+activation quantization that (1) runs on modern GPUs, (2) keeps high accuracy, and (3) increases serving throughput.

Main Contribution

Atom: a low-bit weight+activation quantization recipe combining mixed-precision (keep outlier channels at higher bits), fine-grained group quantization, dynamic activation quantization, and channel reordering to enable efficient mixed-precision kernels.

System and kernel co-design: fused GEMM and fused FlashInfer operators plus quantized KV-cache to reduce memory movement and exploit INT4/INT8 Tensor Cores.

Key Findings

Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.

Numbers7.73× vs FP16; 2.53× vs INT8 (paper Figure 10 / §5.3.2)

Practical UseDeploying Atom's W4A4 kernels can multiply token-generation throughput multiple-fold on supported GPUs, letting you serve more requests per GPU or reduce GPU count.

Evidence Ref§5.3.2, Figure 10

Accuracy degradation at W4A4 is small: average zero-shot drop ≤1.4% and WikiText2 perplexity rise ≈0.3 for Llama-65B.

Numbers≤1.4% avg zero-shot drop; ≈0.3 WikiText2 PPL increase on Llama-65B (abstract, §5.2)

Practical UseYou can quantize large Llama models to 4-bit and keep near-baseline task accuracy for many zero-shot tasks.

Evidence RefAbstract; §5.2, Table 1–2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
End-to-end throughput	Up to 7.73× vs FP16; 2.53× vs INT8	FP16 / INT8	7.73× and 2.53×	Serving workloads (ShareGPT traces) (§5.3.2)	Figure 10; §5.3.2	Figure 10
Accuracy	≤1.4% average drop	FP16	≤1.4% absolute	lm-eval tasks (PIQA, ARC, BoolQ, HellaSwag, WinoGrande) (§5.2, Table 1)	Table 1; §5.2	Table 1

What To Try In 7 Days

Run Atom-style W4A4 post-training quantization on a small Llama model (7B) with 128-sample calibration to measure latency and accuracy changes.

Profile dense GEMM and attention kernels to compare FP16 vs INT8 vs W4A4 and confirm kernel-level speedups.

Quantize KV-cache and test larger batch sizes to see if you can serve more concurrent sessions within existing GPU RAM limits.

Optimization Features

Token Efficiency

larger batch sizes enabled by KV quantizationreduced memory movement for KV-cache

Infra Optimization

use INT4/INT8 Tensor Cores on modern NVIDIA GPUstested on RTX 4090 and RTX Ada 6000

Model Optimization

mixed-precision for outliers (keep 128 channels at INT8/FP16)fine-grained per-group quantization (group size 128)GPTQ for offline weight quantization

System Optimization

kernel fusion to hide quantization overheadreorder weights offline to maintain regular memory access

Inference Optimization

dynamic activation quantization (runtime stats)channel reordering to pack outliersfused GEMM and fused FlashInfer kernelsquantized KV-cache (asymmetric for KV)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Requires custom fused GPU kernels and kernel maintenance across GPU generations.

Group quantization fusion introduces nontrivial kernel overhead that needs hardware support to fully amortize.

When Not To Use

You cannot modify or ship custom GPU kernels in your deployment.

You need exact FP16 numeric fidelity for a downstream task.

Failure Modes

If outliers are misidentified, quantization error can spike and break accuracy (W4A4 RTN showed huge PPL without outlier handling).

Group dequantization overhead can reduce theoretical kernel TOPS if not fused.

Core Entities

Models

Llama-7BLlama-13BLlama-30BLlama-65BLlama-2 (7B/13B/70B)Mixtral (8x7B)

Metrics

tokens per secondaverage decode latency per token (ms)perplexityAccuracyTOPS (kernel throughput)

Datasets

WikiText2PTBC4lm-eval (PIQA, ARC, BoolQ, HellaSwag, WinoGrande)ShareGPT (workload traces)

Benchmarks

Accuracyperplexitythroughput (tokens/s)decode latency (ms/token)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.

Accuracy degradation at W4A4 is small: average zero-shot drop ≤1.4% and WikiText2 perplexity rise ≈0.3 for Llama-65B.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding