Reorder quantized weights to avoid shared-memory bank conflicts and speed up LLM inference up to ~1.9×

February 15, 20246 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, Hyungjun Kim

Links

Abstract / PDF

Why It Matters For Business

QUICK delivers 20–90%+ throughput improvements for batched LLM inference by eliminating shared-memory write stalls, lowering GPU cost per token and allowing larger batch inference using quantized models.

Summary TLDR

QUICK is a set of CUDA kernels that reorder (interleave) offline 4-bit quantized weight matrices so dequantized values can be loaded directly from DRAM into registers. This removes shared-memory write-backs that cause bank conflicts and stalls in mixed-precision GEMM on NVIDIA GPUs. On common LLM workloads QUICK shows up to ~1.9× matrix-multiply speedups versus AutoAWQ kernels and up to ~1.94× end-to-end token throughput on evaluated models and GPUs. Code is available.

Problem Statement

Weight-only quantization reduces model memory but requires dequantization before GEMM. Existing mixed-precision kernels write dequantized weights to shared memory, causing bank conflicts that hurt throughput at larger batch sizes. The paper targets this dequantization write-back bottleneck to speed up inference.

Main Contribution

Introduce QUICK: offline interleaving of quantized weight matrices to match Tensor Core load patterns and skip shared-memory write-back.

Modify parallel dequantization kernel and combine two reordering patterns to keep dequantized weights sequential and reduce bank conflicts.

Tune tile sizes and avoid storing weights in shared memory to trade shared-memory pressure for register usage and improve throughput for larger batches.

Provide CUDA kernel implementation and integration with vLLM; release code on GitHub.

Key Findings

QUICK reduces shared-memory bank conflicts that bottleneck mixed-precision GEMM during dequantization.

Matrix-multiply throughput vs AutoAWQ-Kernel improved by 1.33–1.91× at batch 256 on evaluated GPUs.

Numbers1.33–1.91× speedup at batch=256

End-to-end token throughput across tested LLMs reached up to 1.94× versus AutoAWQ-Kernel.

Numbersup to 1.94× throughput

vLLM integration: Vicuna-13B throughput rose 27% versus AWQ and 33% versus FP16; Llama-2-70B rose 29% versus AWQ.

NumbersVicuna: +27% AWQ, +33% FP16; Llama-2-70B: +29% AWQ

QUICK can sometimes outperform fp16 kernels at medium-large batch sizes (example: batch 128).

Numbersfaster than fp16 at batch=128 in some tests

Performance still lags fp16 at very large batches (>512) and further dequantization optimization is needed.

Numbers>512 batches exhibit lower efficiency than fp16

Results

Matrix multiply speedup vs AutoAWQ-Kernel

Value1.33–1.91×

BaselineAutoAWQ-Kernel

End-to-end token throughput vs AutoAWQ-Kernel

Valueup to 1.94×

BaselineAutoAWQ-Kernel

vLLM throughput (Vicuna-13B)

Value1308.6 tokens/s

BaselineAWQ: 1030.4 tokens/s

vLLM throughput (Llama-2-70B)

Value290.2 tokens/s

BaselineAWQ: 224.3 tokens/s; FP16: OOM

Who Should Care

What To Try In 7 Days

Grab QUICK from GitHub and run provided vLLM integration on a representative model and GPU.

Measure tokens/s vs your current AWQ or fp16 setup at batch sizes 32–256.

If beneficial, add weight-interleaving to your model export pipeline to reuse across deployments.

Optimization Features

Infra Optimization

  • better utilization of NVIDIA Tensor Cores
  • reduced effective DRAM accesses per tile

Model Optimization

  • weight-only quantization (4-bit)

System Optimization

  • shift storage from shared memory to registers
  • reduce shared-memory bank conflicts
  • improve data locality for dequantization

Inference Optimization

  • interleaved weight layout to match Tensor Core loads
  • skip shared-memory write-back of dequantized weights
  • tile size and warp occupancy tuning

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Gains are strongest for medium-large batches (≥32) and may not match fp16 at very large batches (>512).
  • Requires offline weight reordering; not ideal if weights change frequently at runtime.
  • Shifts pressure from shared memory to registers which can limit occupancy on some GPUs.

When Not To Use

  • For single-request or very small-batch inference where dequantization overhead is minor.
  • When model weights are updated frequently and offline reordering is impractical.
  • On hardware that lacks the targeted Tensor Core load/store patterns.

Failure Modes

  • Increased register use can reduce active warps and negate throughput gains on some GPUs.
  • Incorrect interleaving or mismatched kernel assumptions could yield wrong computations.
  • Performance gains shrink or reverse at extreme batch sizes (>512) per reported results.

Core Entities

Models

  • Mistral-7B
  • Vicuna-13B
  • LLaMA-2-13B
  • LLaMA-33B
  • Llama-2-70B

Metrics

  • TOPS (tera-ops/sec)
  • tokens/s
  • speedup ratio
  • shared memory bank conflicts (counts)

Benchmarks

  • matrix multiplication batch×8192×8192
  • vLLM throughput benchmark (recommended dataset)

Context Entities

Models

  • fp16 baselines (fp16 GEMM)
  • AutoAWQ mixed-precision kernels

Metrics

  • tokens/s
  • throughput speedup
  • active warps per multiprocessor

Benchmarks

  • matrix multiplication microbenchmarks
  • end-to-end token generation on multiple GPUs