Reorder quantized weights to avoid shared-memory bank conflicts and speed up LLM inference up to ~1.9×

Overview

Decision SnapshotReady For Pilot

The approach is practical and implemented; evidence shows consistent throughput gains on multiple GPUs and real serving code, but gains depend on batch size and GPU characteristics.

Citations2

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, Hyungjun Kim

Links

Abstract / PDF / Code

Why It Matters For Business

QUICK delivers 20–90%+ throughput improvements for batched LLM inference by eliminating shared-memory write stalls, lowering GPU cost per token and allowing larger batch inference using quantized models.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

QUICK is a set of CUDA kernels that reorder (interleave) offline 4-bit quantized weight matrices so dequantized values can be loaded directly from DRAM into registers. This removes shared-memory write-backs that cause bank conflicts and stalls in mixed-precision GEMM on NVIDIA GPUs. On common LLM workloads QUICK shows up to ~1.9× matrix-multiply speedups versus AutoAWQ kernels and up to ~1.94× end-to-end token throughput on evaluated models and GPUs. Code is available.

Problem Statement

Weight-only quantization reduces model memory but requires dequantization before GEMM. Existing mixed-precision kernels write dequantized weights to shared memory, causing bank conflicts that hurt throughput at larger batch sizes. The paper targets this dequantization write-back bottleneck to speed up inference.

Main Contribution

Introduce QUICK: offline interleaving of quantized weight matrices to match Tensor Core load patterns and skip shared-memory write-back.

Modify parallel dequantization kernel and combine two reordering patterns to keep dequantized weights sequential and reduce bank conflicts.

Key Findings

QUICK reduces shared-memory bank conflicts that bottleneck mixed-precision GEMM during dequantization.

Practical UseReordering weights offline removes many write-back stalls, letting mixed-precision kernels scale better with batch size.

Evidence RefFigure 3, Section 2.3

Matrix-multiply throughput vs AutoAWQ-Kernel improved by 1.33–1.91× at batch 256 on evaluated GPUs.

Numbers1.33–1.91× speedup at batch=256

Practical UseIf you use AutoAWQ kernels, replacing them with QUICK can cut GEMM time roughly in half for large batches.

Evidence RefSection 4.1, Figure 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Matrix multiply speedup vs AutoAWQ-Kernel	1.33–1.91×	AutoAWQ-Kernel	—	batch=256, matrices batch×8192×8192	Section 4.1, Figure 7	—
End-to-end token throughput vs AutoAWQ-Kernel	up to 1.94×	AutoAWQ-Kernel	—	various LLMs and GPUs (Section 4.2)	Section 4.2, Figure 8	—

What To Try In 7 Days

Grab QUICK from GitHub and run provided vLLM integration on a representative model and GPU.

Measure tokens/s vs your current AWQ or fp16 setup at batch sizes 32–256.

If beneficial, add weight-interleaving to your model export pipeline to reuse across deployments.

Optimization Features

Infra Optimization

better utilization of NVIDIA Tensor Coresreduced effective DRAM accesses per tile

Model Optimization

weight-only quantization (4-bit)

System Optimization

shift storage from shared memory to registersreduce shared-memory bank conflictsimprove data locality for dequantization

Inference Optimization

interleaved weight layout to match Tensor Core loadsskip shared-memory write-back of dequantized weightstile size and warp occupancy tuning

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/SqueezeBits/QUICK

Risks & Boundaries

Limitations

Gains are strongest for medium-large batches (≥32) and may not match fp16 at very large batches (>512).

Requires offline weight reordering; not ideal if weights change frequently at runtime.

When Not To Use

For single-request or very small-batch inference where dequantization overhead is minor.

When model weights are updated frequently and offline reordering is impractical.

Failure Modes

Increased register use can reduce active warps and negate throughput gains on some GPUs.

Incorrect interleaving or mismatched kernel assumptions could yield wrong computations.

Core Entities

Models

Mistral-7BVicuna-13BLLaMA-2-13BLLaMA-33BLlama-2-70B

Metrics

TOPS (tera-ops/sec)tokens/sspeedup ratioshared memory bank conflicts (counts)

Benchmarks

matrix multiplication batch×8192×8192vLLM throughput benchmark (recommended dataset)

Context Entities

Models

fp16 baselines (fp16 GEMM)AutoAWQ mixed-precision kernels

Metrics

tokens/sthroughput speedupactive warps per multiprocessor

Benchmarks

matrix multiplication microbenchmarksend-to-end token generation on multiple GPUs

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

QUICK reduces shared-memory bank conflicts that bottleneck mixed-precision GEMM during dequantization.

Matrix-multiply throughput vs AutoAWQ-Kernel improved by 1.33–1.91× at batch 256 on evaluated GPUs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

Context Entities

Models

Metrics

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding