Reorder quantized weights to avoid shared-memory bank conflicts and speed up LLM inference up to ~1.9×

February 15, 20246 min

Overview

Decision SnapshotReady For Pilot

The approach is practical and implemented; evidence shows consistent throughput gains on multiple GPUs and real serving code, but gains depend on batch size and GPU characteristics.

Citations2

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, Hyungjun Kim

Links

Abstract / PDF / Code

Why It Matters For Business

QUICK delivers 20–90%+ throughput improvements for batched LLM inference by eliminating shared-memory write stalls, lowering GPU cost per token and allowing larger batch inference using quantized models.

Who Should Care

Summary TLDR

QUICK is a set of CUDA kernels that reorder (interleave) offline 4-bit quantized weight matrices so dequantized values can be loaded directly from DRAM into registers. This removes shared-memory write-backs that cause bank conflicts and stalls in mixed-precision GEMM on NVIDIA GPUs. On common LLM workloads QUICK shows up to ~1.9× matrix-multiply speedups versus AutoAWQ kernels and up to ~1.94× end-to-end token throughput on evaluated models and GPUs. Code is available.

Problem Statement

Weight-only quantization reduces model memory but requires dequantization before GEMM. Existing mixed-precision kernels write dequantized weights to shared memory, causing bank conflicts that hurt throughput at larger batch sizes. The paper targets this dequantization write-back bottleneck to speed up inference.

Main Contribution

Introduce QUICK: offline interleaving of quantized weight matrices to match Tensor Core load patterns and skip shared-memory write-back.

Modify parallel dequantization kernel and combine two reordering patterns to keep dequantized weights sequential and reduce bank conflicts.

Key Findings

QUICK reduces shared-memory bank conflicts that bottleneck mixed-precision GEMM during dequantization.

Practical UseReordering weights offline removes many write-back stalls, letting mixed-precision kernels scale better with batch size.

Evidence RefFigure 3, Section 2.3

Matrix-multiply throughput vs AutoAWQ-Kernel improved by 1.33–1.91× at batch 256 on evaluated GPUs.

Numbers1.331.91× speedup at batch=256

Practical UseIf you use AutoAWQ kernels, replacing them with QUICK can cut GEMM time roughly in half for large batches.

Evidence RefSection 4.1, Figure 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Matrix multiply speedup vs AutoAWQ-Kernel1.331.91×AutoAWQ-Kernelbatch=256, matrices batch×8192×8192Section 4.1, Figure 7
End-to-end token throughput vs AutoAWQ-Kernelup to 1.94×AutoAWQ-Kernelvarious LLMs and GPUs (Section 4.2)Section 4.2, Figure 8

What To Try In 7 Days

Grab QUICK from GitHub and run provided vLLM integration on a representative model and GPU.

Measure tokens/s vs your current AWQ or fp16 setup at batch sizes 32–256.

If beneficial, add weight-interleaving to your model export pipeline to reuse across deployments.

Optimization Features

Infra Optimization
better utilization of NVIDIA Tensor Coresreduced effective DRAM accesses per tile
Model Optimization
weight-only quantization (4-bit)
System Optimization
shift storage from shared memory to registersreduce shared-memory bank conflictsimprove data locality for dequantization
Inference Optimization
interleaved weight layout to match Tensor Core loadsskip shared-memory write-back of dequantized weightstile size and warp occupancy tuning

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Gains are strongest for medium-large batches (≥32) and may not match fp16 at very large batches (>512).

Requires offline weight reordering; not ideal if weights change frequently at runtime.

When Not To Use

For single-request or very small-batch inference where dequantization overhead is minor.

When model weights are updated frequently and offline reordering is impractical.

Failure Modes

Increased register use can reduce active warps and negate throughput gains on some GPUs.

Incorrect interleaving or mismatched kernel assumptions could yield wrong computations.

Core Entities

Models

Mistral-7BVicuna-13BLLaMA-2-13BLLaMA-33BLlama-2-70B

Metrics

TOPS (tera-ops/sec)tokens/sspeedup ratioshared memory bank conflicts (counts)

Benchmarks

matrix multiplication batch×8192×8192vLLM throughput benchmark (recommended dataset)

Context Entities

Models

fp16 baselines (fp16 GEMM)AutoAWQ mixed-precision kernels

Metrics

tokens/sthroughput speedupactive warps per multiprocessor

Benchmarks

matrix multiplication microbenchmarksend-to-end token generation on multiple GPUs