W4A8KV4 (4-bit weight, 8-bit activation, 4-bit KV) plus system kernels to double LLM serving throughput on common GPUs

Overview

Decision SnapshotReady For Pilot

The paper demonstrates open-source kernels and end-to-end benchmarks on A100/L40S with realistic setups and Dockerized artifacts; results are reproducible but need GPU-specific kernel work and validation on your workloads.

Citations6

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

Links

Abstract / PDF / Code

Why It Matters For Business

QServe turns 4-bit quantization into real GPU speedups and memory savings, cutting serving cost per token (authors report ~3× dollar cost reduction by using L40S+QServe versus A100+TensorRT-LLM).

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

QServe introduces QoQ, a quantization algorithm and CUDA-level runtime that target W4A8KV4 precision (4-bit weights, 8-bit activations, 4-bit KV cache). The quantization method (progressive group quantization + SmoothAttention) preserves accuracy near FP16. The runtime (compute-aware weight reorder, register-level parallelism, subtraction-after-multiply) reduces dequantization overhead so GEMMs run on INT8 tensor cores. Measured gains: ~1.2–2.4× throughput on A100 and ~1.5–3.5× on L40S versus TensorRT-LLM; code released at omniserve repository.

Problem Statement

Lower-bit quantization (e.g., 4-bit) should speed up LLM serving but existing 4-bit methods slow down in cloud/batched GPU serving because dequantization and partial-sum conversions run on slow CUDA cores. The paper addresses how to quantize and implement kernels so low-bit models actually run faster on real GPUs.

Main Contribution

QoQ quantization algorithm: progressive group quantization that maps W4A8 GEMM to INT8 tensor cores and SmoothAttention to preserve accuracy when KV is 4-bit.

QServe runtime and CUDA/PTX kernels: compute-aware weight reordering, register-level parallel unpacking, and subtraction-after-multiplication to shrink dequantization overhead.

Key Findings

Prior 4-bit methods incur large runtime dequantization overhead on GPUs.

Numbers20–90% runtime overhead reported for dequantization

Practical UseDo not expect raw 4-bit quantization to speed up cloud LLM serving unless you eliminate dequantization on slow CUDA cores.

Evidence RefSection 3, Fig.18

QServe increases throughput versus TensorRT-LLM across tested models.

Numbersavg speedup 2.36× on L40S and 1.68× on A100 (over TensorRT-LLM best config)

Practical UseSwitching to QServe can more than double throughput on L40S and give a solid uplift on A100 under same memory budget.

Evidence RefFig.15, Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
throughput (tokens/sec) on L40S	QServe 3656 (Llama-3-8B)	TensorRT-LLM W8A8 2634	1.39×	seq in=1024 out=512, same memory budget	Table 4 (L40S)	Table 4
throughput (tokens/sec) on A100	QServe 3005 (Llama-3-8B)	TensorRT-LLM W8A8 2396	1.20×	seq in=1024 out=512, same memory budget	Table 4 (A100)	Table 4

What To Try In 7 Days

Clone omniserve/QServe and run provided Docker benchmark on a spare L40S or A100 to reproduce throughput numbers.

Quantize a 7B model with QoQ W4A8KV4 g128 and compare throughput/perplexity vs your current FP16 or W8A8 baseline.

Enable per-head dynamic KV4 quantization and validate long-context (LongBench) performance before production rollout.

Optimization Features

Token Efficiency

KV4 halves KV memory traffic vs KV8 (2× theoretical attention peak)

Infra Optimization

custom CUDA/PTX kernels and assembly for GEMM and attentionoptimizations targeted to A100 and L40S roofline properties

Model Optimization

progressive group quantization (INT8 intermediate then INT4 groups)SmoothAttention (scale down key outliers)activation-aware channel reorderingblock input rotation and block output smoothingweight clipping tuned to layer output MSE

System Optimization

compute-aware weight reorder for 128-bit packed loadspaged per-head dynamic KV quantization with in-page FP16 scalesFP16 QK and SV conversions to delay roofline turn pointasynchronous prefetch of dequant params

Inference Optimization

W4A8KV4 precision (4-bit weight, 8-bit activation, 4-bit KV)per-channel and per-group quantization variants (g128 tested)subtraction-after-multiplication to move zero-point cost to epilogueregister-level parallel unpacking of UINT4→INT8

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseApache-2.0

Code URLs

https://github.com/mit-han-lab/omniserve

Risks & Boundaries

Limitations

Kernel-level optimizations require CUDA/PTX coding and target A100/Hopper/L40S characteristics; porting to other hardware may need rework.

Some accuracy loss persists versus FP16; not suitable when exact FP16 parity is required.

When Not To Use

When bit-exact FP16 outputs are required for downstream systems.

On hardware without comparable INT8/INT4 tensor core support or without ability to run custom CUDA kernels.

Failure Modes

Incorrect protective range handling can produce overflow during dequantization and corrupt results.

If dequantization or attention prefetching is not applied, kernels can become compute-bound and be slower than baselines.

Core Entities

Models

Llama-3-8BLlama-2-7BLlama-2-13BLlama-2-70BMistral-7BMixtral-8x7BYi-34BQwen1.5-72B

Metrics

throughput (tokens/sec)perplexity (lower better)Accuracylatency (ms)

Datasets

WikiText2PIQAARCHellaSwagWinoGrandeLongBench

Benchmarks

perplexityAccuracytokens/second throughputLong-context metrics (LongBench)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prior 4-bit methods incur large runtime dequantization overhead on GPUs.

QServe increases throughput versus TensorRT-LLM across tested models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding