Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
23
Why It Matters For Business
Atom can multiply token throughput per GPU and shrink KV-cache memory, lowering cloud GPU costs or increasing capacity without major task-accuracy loss.
Summary TLDR
Atom is a post-training quantization method and serving workflow that uses mixed-precision, per-group quantization, dynamic activation quantization, channel reordering, and KV-cache quantization. On Llama-family models Atom runs 4-bit weight+activation (W4A4) inference using custom fused GPU kernels and achieves up to 7.73× token throughput vs FP16 and 2.53× vs INT8 while keeping accuracy losses small (≤1.4% average zero-shot drop; ≈0.3 WikiText2 PPL rise on Llama-65B). Atom requires offline calibration and custom kernels but integrates end-to-end into a serving stack.
Problem Statement
LLM serving needs higher throughput and lower memory use. Existing weight-only quantization still forces FP math and can’t use low-bit hardware efficiently. Prior weight-activation methods at 4-bit lose accuracy. The problem: get practical, low-bit (e.g., INT4) weight+activation quantization that (1) runs on modern GPUs, (2) keeps high accuracy, and (3) increases serving throughput.
Main Contribution
Atom: a low-bit weight+activation quantization recipe combining mixed-precision (keep outlier channels at higher bits), fine-grained group quantization, dynamic activation quantization, and channel reordering to enable efficient mixed-precision kernels.
System and kernel co-design: fused GEMM and fused FlashInfer operators plus quantized KV-cache to reduce memory movement and exploit INT4/INT8 Tensor Cores.
Comprehensive evaluation on Llama-family models showing large throughput gains with small accuracy degradation, plus ablations quantifying each ingredient's impact.
Key Findings
Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.
Accuracy degradation at W4A4 is small: average zero-shot drop ≤1.4% and WikiText2 perplexity rise ≈0.3 for Llama-65B.
Kernel-level speedups: fused GEMM at batch 512 gives 3.4× over FP16 and 1.9× over INT8; self-attention at batch 128 gives 3.5× over FP16 and 1.8× over INT8.
Outlier handling and group quantization are the main accuracy drivers: keeping 128 outlier channels in high precision reduces W4A4 PPL from thousands to ≈11 on Llama-7B; group quantization then reduces it to ≈6.
Results
End-to-end throughput
Accuracy
Perplexity change (Llama-65B, W4A4)
Kernel GEMM throughput
Self-attention speedup
Offline quantization time
Who Should Care
What To Try In 7 Days
Run Atom-style W4A4 post-training quantization on a small Llama model (7B) with 128-sample calibration to measure latency and accuracy changes.
Profile dense GEMM and attention kernels to compare FP16 vs INT8 vs W4A4 and confirm kernel-level speedups.
Quantize KV-cache and test larger batch sizes to see if you can serve more concurrent sessions within existing GPU RAM limits.
Optimization Features
Token Efficiency
- larger batch sizes enabled by KV quantization
- reduced memory movement for KV-cache
Infra Optimization
- use INT4/INT8 Tensor Cores on modern NVIDIA GPUs
- tested on RTX 4090 and RTX Ada 6000
Model Optimization
- mixed-precision for outliers (keep 128 channels at INT8/FP16)
- fine-grained per-group quantization (group size 128)
- GPTQ for offline weight quantization
System Optimization
- kernel fusion to hide quantization overhead
- reorder weights offline to maintain regular memory access
Inference Optimization
- dynamic activation quantization (runtime stats)
- channel reordering to pack outliers
- fused GEMM and fused FlashInfer kernels
- quantized KV-cache (asymmetric for KV)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires custom fused GPU kernels and kernel maintenance across GPU generations.
- Group quantization fusion introduces nontrivial kernel overhead that needs hardware support to fully amortize.
- Evaluation focused on NVIDIA hardware (RTX 4090, Ada); other accelerators untested.
- Offline preprocessing can be slow for very large models (≈4 hours for Llama-65B).
When Not To Use
- You cannot modify or ship custom GPU kernels in your deployment.
- You need exact FP16 numeric fidelity for a downstream task.
- Target hardware lacks INT4/INT8 tensor core support or equivalent primitives.
Failure Modes
- If outliers are misidentified, quantization error can spike and break accuracy (W4A4 RTN showed huge PPL without outlier handling).
- Group dequantization overhead can reduce theoretical kernel TOPS if not fused.
- Dynamic quantization without fusion may increase latency.
Core Entities
Models
- Llama-7B
- Llama-13B
- Llama-30B
- Llama-65B
- Llama-2 (7B/13B/70B)
- Mixtral (8x7B)
Metrics
- tokens per second
- average decode latency per token (ms)
- perplexity
- Accuracy
- TOPS (kernel throughput)
Datasets
- WikiText2
- PTB
- C4
- lm-eval (PIQA, ARC, BoolQ, HellaSwag, WinoGrande)
- ShareGPT (workload traces)
Benchmarks
- Accuracy
- perplexity
- throughput (tokens/s)
- decode latency (ms/token)

