Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
QUICK delivers 20–90%+ throughput improvements for batched LLM inference by eliminating shared-memory write stalls, lowering GPU cost per token and allowing larger batch inference using quantized models.
Summary TLDR
QUICK is a set of CUDA kernels that reorder (interleave) offline 4-bit quantized weight matrices so dequantized values can be loaded directly from DRAM into registers. This removes shared-memory write-backs that cause bank conflicts and stalls in mixed-precision GEMM on NVIDIA GPUs. On common LLM workloads QUICK shows up to ~1.9× matrix-multiply speedups versus AutoAWQ kernels and up to ~1.94× end-to-end token throughput on evaluated models and GPUs. Code is available.
Problem Statement
Weight-only quantization reduces model memory but requires dequantization before GEMM. Existing mixed-precision kernels write dequantized weights to shared memory, causing bank conflicts that hurt throughput at larger batch sizes. The paper targets this dequantization write-back bottleneck to speed up inference.
Main Contribution
Introduce QUICK: offline interleaving of quantized weight matrices to match Tensor Core load patterns and skip shared-memory write-back.
Modify parallel dequantization kernel and combine two reordering patterns to keep dequantized weights sequential and reduce bank conflicts.
Tune tile sizes and avoid storing weights in shared memory to trade shared-memory pressure for register usage and improve throughput for larger batches.
Provide CUDA kernel implementation and integration with vLLM; release code on GitHub.
Key Findings
QUICK reduces shared-memory bank conflicts that bottleneck mixed-precision GEMM during dequantization.
Matrix-multiply throughput vs AutoAWQ-Kernel improved by 1.33–1.91× at batch 256 on evaluated GPUs.
End-to-end token throughput across tested LLMs reached up to 1.94× versus AutoAWQ-Kernel.
vLLM integration: Vicuna-13B throughput rose 27% versus AWQ and 33% versus FP16; Llama-2-70B rose 29% versus AWQ.
QUICK can sometimes outperform fp16 kernels at medium-large batch sizes (example: batch 128).
Performance still lags fp16 at very large batches (>512) and further dequantization optimization is needed.
Results
Matrix multiply speedup vs AutoAWQ-Kernel
End-to-end token throughput vs AutoAWQ-Kernel
vLLM throughput (Vicuna-13B)
vLLM throughput (Llama-2-70B)
Who Should Care
What To Try In 7 Days
Grab QUICK from GitHub and run provided vLLM integration on a representative model and GPU.
Measure tokens/s vs your current AWQ or fp16 setup at batch sizes 32–256.
If beneficial, add weight-interleaving to your model export pipeline to reuse across deployments.
Optimization Features
Infra Optimization
- better utilization of NVIDIA Tensor Cores
- reduced effective DRAM accesses per tile
Model Optimization
- weight-only quantization (4-bit)
System Optimization
- shift storage from shared memory to registers
- reduce shared-memory bank conflicts
- improve data locality for dequantization
Inference Optimization
- interleaved weight layout to match Tensor Core loads
- skip shared-memory write-back of dequantized weights
- tile size and warp occupancy tuning
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Gains are strongest for medium-large batches (≥32) and may not match fp16 at very large batches (>512).
- Requires offline weight reordering; not ideal if weights change frequently at runtime.
- Shifts pressure from shared memory to registers which can limit occupancy on some GPUs.
When Not To Use
- For single-request or very small-batch inference where dequantization overhead is minor.
- When model weights are updated frequently and offline reordering is impractical.
- On hardware that lacks the targeted Tensor Core load/store patterns.
Failure Modes
- Increased register use can reduce active warps and negate throughput gains on some GPUs.
- Incorrect interleaving or mismatched kernel assumptions could yield wrong computations.
- Performance gains shrink or reverse at extreme batch sizes (>512) per reported results.
Core Entities
Models
- Mistral-7B
- Vicuna-13B
- LLaMA-2-13B
- LLaMA-33B
- Llama-2-70B
Metrics
- TOPS (tera-ops/sec)
- tokens/s
- speedup ratio
- shared memory bank conflicts (counts)
Benchmarks
- matrix multiplication batch×8192×8192
- vLLM throughput benchmark (recommended dataset)
Context Entities
Models
- fp16 baselines (fp16 GEMM)
- AutoAWQ mixed-precision kernels
Metrics
- tokens/s
- throughput speedup
- active warps per multiprocessor
Benchmarks
- matrix multiplication microbenchmarks
- end-to-end token generation on multiple GPUs

