W4A8KV4 (4-bit weight, 8-bit activation, 4-bit KV) plus system kernels to double LLM serving throughput on common GPUs

May 7, 20249 min

Overview

Decision SnapshotReady For Pilot

The paper demonstrates open-source kernels and end-to-end benchmarks on A100/L40S with realistic setups and Dockerized artifacts; results are reproducible but need GPU-specific kernel work and validation on your workloads.

Citations6

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

Links

Abstract / PDF / Code

Why It Matters For Business

QServe turns 4-bit quantization into real GPU speedups and memory savings, cutting serving cost per token (authors report ~3× dollar cost reduction by using L40S+QServe versus A100+TensorRT-LLM).

Who Should Care

Summary TLDR

QServe introduces QoQ, a quantization algorithm and CUDA-level runtime that target W4A8KV4 precision (4-bit weights, 8-bit activations, 4-bit KV cache). The quantization method (progressive group quantization + SmoothAttention) preserves accuracy near FP16. The runtime (compute-aware weight reorder, register-level parallelism, subtraction-after-multiply) reduces dequantization overhead so GEMMs run on INT8 tensor cores. Measured gains: ~1.2–2.4× throughput on A100 and ~1.5–3.5× on L40S versus TensorRT-LLM; code released at omniserve repository.

Problem Statement

Lower-bit quantization (e.g., 4-bit) should speed up LLM serving but existing 4-bit methods slow down in cloud/batched GPU serving because dequantization and partial-sum conversions run on slow CUDA cores. The paper addresses how to quantize and implement kernels so low-bit models actually run faster on real GPUs.

Main Contribution

QoQ quantization algorithm: progressive group quantization that maps W4A8 GEMM to INT8 tensor cores and SmoothAttention to preserve accuracy when KV is 4-bit.

QServe runtime and CUDA/PTX kernels: compute-aware weight reordering, register-level parallel unpacking, and subtraction-after-multiplication to shrink dequantization overhead.

Key Findings

Prior 4-bit methods incur large runtime dequantization overhead on GPUs.

Numbers2090% runtime overhead reported for dequantization

Practical UseDo not expect raw 4-bit quantization to speed up cloud LLM serving unless you eliminate dequantization on slow CUDA cores.

Evidence RefSection 3, Fig.18

QServe increases throughput versus TensorRT-LLM across tested models.

Numbersavg speedup 2.36× on L40S and 1.68× on A100 (over TensorRT-LLM best config)

Practical UseSwitching to QServe can more than double throughput on L40S and give a solid uplift on A100 under same memory budget.

Evidence RefFig.15, Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
throughput (tokens/sec) on L40SQServe 3656 (Llama-3-8B)TensorRT-LLM W8A8 26341.39×seq in=1024 out=512, same memory budgetTable 4 (L40S)Table 4
throughput (tokens/sec) on A100QServe 3005 (Llama-3-8B)TensorRT-LLM W8A8 23961.20×seq in=1024 out=512, same memory budgetTable 4 (A100)Table 4

What To Try In 7 Days

Clone omniserve/QServe and run provided Docker benchmark on a spare L40S or A100 to reproduce throughput numbers.

Quantize a 7B model with QoQ W4A8KV4 g128 and compare throughput/perplexity vs your current FP16 or W8A8 baseline.

Enable per-head dynamic KV4 quantization and validate long-context (LongBench) performance before production rollout.

Optimization Features

Token Efficiency
KV4 halves KV memory traffic vs KV8 (2× theoretical attention peak)
Infra Optimization
custom CUDA/PTX kernels and assembly for GEMM and attentionoptimizations targeted to A100 and L40S roofline properties
Model Optimization
progressive group quantization (INT8 intermediate then INT4 groups)SmoothAttention (scale down key outliers)activation-aware channel reorderingblock input rotation and block output smoothingweight clipping tuned to layer output MSE
System Optimization
compute-aware weight reorder for 128-bit packed loadspaged per-head dynamic KV quantization with in-page FP16 scalesFP16 QK and SV conversions to delay roofline turn pointasynchronous prefetch of dequant params
Inference Optimization
W4A8KV4 precision (4-bit weight, 8-bit activation, 4-bit KV)per-channel and per-group quantization variants (g128 tested)subtraction-after-multiplication to move zero-point cost to epilogueregister-level parallel unpacking of UINT4→INT8

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseApache-2.0

Risks & Boundaries

Limitations

Kernel-level optimizations require CUDA/PTX coding and target A100/Hopper/L40S characteristics; porting to other hardware may need rework.

Some accuracy loss persists versus FP16; not suitable when exact FP16 parity is required.

When Not To Use

When bit-exact FP16 outputs are required for downstream systems.

On hardware without comparable INT8/INT4 tensor core support or without ability to run custom CUDA kernels.

Failure Modes

Incorrect protective range handling can produce overflow during dequantization and corrupt results.

If dequantization or attention prefetching is not applied, kernels can become compute-bound and be slower than baselines.

Core Entities

Models

Llama-3-8BLlama-2-7BLlama-2-13BLlama-2-70BMistral-7BMixtral-8x7BYi-34BQwen1.5-72B

Metrics

throughput (tokens/sec)perplexity (lower better)Accuracylatency (ms)

Datasets

WikiText2PIQAARCHellaSwagWinoGrandeLongBench

Benchmarks

perplexityAccuracytokens/second throughputLong-context metrics (LongBench)