Overview
The approach is practical and implemented; evidence shows consistent throughput gains on multiple GPUs and real serving code, but gains depend on batch size and GPU characteristics.
Citations2
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
QUICK delivers 20–90%+ throughput improvements for batched LLM inference by eliminating shared-memory write stalls, lowering GPU cost per token and allowing larger batch inference using quantized models.
Who Should Care
Summary TLDR
QUICK is a set of CUDA kernels that reorder (interleave) offline 4-bit quantized weight matrices so dequantized values can be loaded directly from DRAM into registers. This removes shared-memory write-backs that cause bank conflicts and stalls in mixed-precision GEMM on NVIDIA GPUs. On common LLM workloads QUICK shows up to ~1.9× matrix-multiply speedups versus AutoAWQ kernels and up to ~1.94× end-to-end token throughput on evaluated models and GPUs. Code is available.
Problem Statement
Weight-only quantization reduces model memory but requires dequantization before GEMM. Existing mixed-precision kernels write dequantized weights to shared memory, causing bank conflicts that hurt throughput at larger batch sizes. The paper targets this dequantization write-back bottleneck to speed up inference.
Main Contribution
Introduce QUICK: offline interleaving of quantized weight matrices to match Tensor Core load patterns and skip shared-memory write-back.
Modify parallel dequantization kernel and combine two reordering patterns to keep dequantized weights sequential and reduce bank conflicts.
Key Findings
QUICK reduces shared-memory bank conflicts that bottleneck mixed-precision GEMM during dequantization.
Matrix-multiply throughput vs AutoAWQ-Kernel improved by 1.33–1.91× at batch 256 on evaluated GPUs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Matrix multiply speedup vs AutoAWQ-Kernel | 1.33–1.91× | AutoAWQ-Kernel | — | batch=256, matrices batch×8192×8192 | Section 4.1, Figure 7 | — |
| End-to-end token throughput vs AutoAWQ-Kernel | up to 1.94× | AutoAWQ-Kernel | — | various LLMs and GPUs (Section 4.2) | Section 4.2, Figure 8 | — |
What To Try In 7 Days
Grab QUICK from GitHub and run provided vLLM integration on a representative model and GPU.
Measure tokens/s vs your current AWQ or fp16 setup at batch sizes 32–256.
If beneficial, add weight-interleaving to your model export pipeline to reuse across deployments.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Gains are strongest for medium-large batches (≥32) and may not match fp16 at very large batches (>512).
Requires offline weight reordering; not ideal if weights change frequently at runtime.
When Not To Use
For single-request or very small-batch inference where dequantization overhead is minor.
When model weights are updated frequently and offline reordering is impractical.
Failure Modes
Increased register use can reduce active warps and negate throughput gains on some GPUs.
Incorrect interleaving or mismatched kernel assumptions could yield wrong computations.

