W4A8KV4 (4-bit weight, 8-bit activation, 4-bit KV) plus system kernels to double LLM serving throughput on common GPUs

May 7, 20249 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

6

Authors

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

Links

Abstract / PDF

Why It Matters For Business

QServe turns 4-bit quantization into real GPU speedups and memory savings, cutting serving cost per token (authors report ~3× dollar cost reduction by using L40S+QServe versus A100+TensorRT-LLM).

Summary TLDR

QServe introduces QoQ, a quantization algorithm and CUDA-level runtime that target W4A8KV4 precision (4-bit weights, 8-bit activations, 4-bit KV cache). The quantization method (progressive group quantization + SmoothAttention) preserves accuracy near FP16. The runtime (compute-aware weight reorder, register-level parallelism, subtraction-after-multiply) reduces dequantization overhead so GEMMs run on INT8 tensor cores. Measured gains: ~1.2–2.4× throughput on A100 and ~1.5–3.5× on L40S versus TensorRT-LLM; code released at omniserve repository.

Problem Statement

Lower-bit quantization (e.g., 4-bit) should speed up LLM serving but existing 4-bit methods slow down in cloud/batched GPU serving because dequantization and partial-sum conversions run on slow CUDA cores. The paper addresses how to quantize and implement kernels so low-bit models actually run faster on real GPUs.

Main Contribution

QoQ quantization algorithm: progressive group quantization that maps W4A8 GEMM to INT8 tensor cores and SmoothAttention to preserve accuracy when KV is 4-bit.

QServe runtime and CUDA/PTX kernels: compute-aware weight reordering, register-level parallel unpacking, and subtraction-after-multiplication to shrink dequantization overhead.

KV4 attention optimizations: per-head dynamic KV quantization, FP16 conversion, bit tricks and prefetching to move attention kernels out of the CUDA-core compute-bound region.

Extensive measurements across 7+ LLMs on A100 and L40S showing consistent throughput and modest accuracy loss compared to FP16 and other quantization methods.

Open-source release (OmniServe / QServe) with reproducible benchmarking scripts and Docker image.

Key Findings

Prior 4-bit methods incur large runtime dequantization overhead on GPUs.

Numbers20–90% runtime overhead reported for dequantization

QServe increases throughput versus TensorRT-LLM across tested models.

Numbersavg speedup 2.36× on L40S and 1.68× on A100 (over TensorRT-LLM best config)

Largest models benefit most on L40S with QServe.

NumbersQwen1.5-72B: 3.5× on L40S, 2.4× on A100 versus TensorRT-LLM

QoQ keeps accuracy close to FP16 on standard NLP tests.

NumbersLlama-2 zero-shot avg loss vs FP16: 1.03% (7B), 0.89% (13B), 0.40% (70B)

Progressive group quantization lets GEMMs run on INT8 tensor cores, cutting GEMM cost.

NumbersW4A8 per-group GEMM achieves 1.5× speedup over W8A8 cuBLAS GEMM in their kernels

Optimized KV4 attention kernel reduces latency significantly.

Numbersfused KV4 kernel latency reduced from 0.48ms to 0.28ms (1.7×) on A100

Results

throughput (tokens/sec) on L40S

ValueQServe 3656 (Llama-3-8B)

BaselineTensorRT-LLM W8A8 2634

throughput (tokens/sec) on A100

ValueQServe 3005 (Llama-3-8B)

BaselineTensorRT-LLM W8A8 2396

throughput speedup (aggregate)

Valueavg 2.36× (L40S) / 1.68× (A100)

BaselineTensorRT-LLM best config

Accuracy

Value≈1.03% (7B), 0.89% (13B), 0.40% (70B)

BaselineFP16

KV4 fused attention latency (A100)

Value0.28 ms (optimized QServe)

Baseline0.48 ms (naive KV4)

Who Should Care

What To Try In 7 Days

Clone omniserve/QServe and run provided Docker benchmark on a spare L40S or A100 to reproduce throughput numbers.

Quantize a 7B model with QoQ W4A8KV4 g128 and compare throughput/perplexity vs your current FP16 or W8A8 baseline.

Enable per-head dynamic KV4 quantization and validate long-context (LongBench) performance before production rollout.

Optimization Features

Token Efficiency

  • KV4 halves KV memory traffic vs KV8 (2× theoretical attention peak)

Infra Optimization

  • custom CUDA/PTX kernels and assembly for GEMM and attention
  • optimizations targeted to A100 and L40S roofline properties

Model Optimization

  • progressive group quantization (INT8 intermediate then INT4 groups)
  • SmoothAttention (scale down key outliers)
  • activation-aware channel reordering
  • block input rotation and block output smoothing
  • weight clipping tuned to layer output MSE

System Optimization

  • compute-aware weight reorder for 128-bit packed loads
  • paged per-head dynamic KV quantization with in-page FP16 scales
  • FP16 QK and SV conversions to delay roofline turn point
  • asynchronous prefetch of dequant params

Inference Optimization

  • W4A8KV4 precision (4-bit weight, 8-bit activation, 4-bit KV)
  • per-channel and per-group quantization variants (g128 tested)
  • subtraction-after-multiplication to move zero-point cost to epilogue
  • register-level parallel unpacking of UINT4→INT8

Reproducibility

License

  • Apache-2.0

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Kernel-level optimizations require CUDA/PTX coding and target A100/Hopper/L40S characteristics; porting to other hardware may need rework.
  • Some accuracy loss persists versus FP16; not suitable when exact FP16 parity is required.
  • Per-head dynamic KV quantization and paged KV design add runtime bookkeeping and potential calibration costs.

When Not To Use

  • When bit-exact FP16 outputs are required for downstream systems.
  • On hardware without comparable INT8/INT4 tensor core support or without ability to run custom CUDA kernels.
  • For tiny models or edge devices where other quantization strategies (weight-only) are already optimal.

Failure Modes

  • Incorrect protective range handling can produce overflow during dequantization and corrupt results.
  • If dequantization or attention prefetching is not applied, kernels can become compute-bound and be slower than baselines.
  • Progressive group quantization parameters (group size, clipping) may need per-model tuning; wrong settings harm accuracy.

Core Entities

Models

  • Llama-3-8B
  • Llama-2-7B
  • Llama-2-13B
  • Llama-2-70B
  • Mistral-7B
  • Mixtral-8x7B
  • Yi-34B
  • Qwen1.5-72B

Metrics

  • throughput (tokens/sec)
  • perplexity (lower better)
  • Accuracy
  • latency (ms)

Datasets

  • WikiText2
  • PIQA
  • ARC
  • HellaSwag
  • WinoGrande
  • LongBench

Benchmarks

  • perplexity
  • Accuracy
  • tokens/second throughput
  • Long-context metrics (LongBench)