Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
A low-complexity, offline reorder can cut inter-GPU communication and speed up quantized LLM inference, lowering latency and increasing throughput for multi-GPU serving without changing model weights.
Summary TLDR
TP-Aware Dequantization rearranges quantized weight storage so GPU-local metadata can be reused and a costly AllGather between column- and row-sharded layers is avoided. Applied to GPTQ-style 4-bit weights, the method targets MLP layers in Transformer blocks and yields up to ~1.8x latency speedups on Llama-70B and Granite-20B across A100/H100 multi-GPU nodes. The change is an offline reorder (permutation) of weight columns so dequantization and GEMM stay local, reducing global communication.
Problem Statement
GPTQ-style quantization stores per-group scales/zeros and an index mapping rows to groups. When GPTQ's activation-order reordering is used, metadata lookups become scattered and, under Tensor Parallel (TP), force extra AllGather communication between column- and row-sharded layers. This increases inference latency and reduces throughput for large LLMs in multi-GPU deployments.
Main Contribution
Identify that GPTQ act_order reordering breaks GPU data locality and causes extra AllGather in TP setups.
Propose an offline permutation-based reorder (argsort) to make group metadata consecutive and cache-friendly.
Introduce a TP-aware trick: permute columns of the first (column-TP) weight shard with the second layer's permutation so the inter-layer AllGather is not needed for MLP layers.
Measure end-to-end latency gains on Llama-70B and Granite-20B MLP sizes across A100 and H100 DGX nodes with up to ~1.8x speedups.
Key Findings
TP-Aware Dequantization speeds up MLP-layer inference in distributed LLMs.
Speedup grows with more tensor-parallel ranks.
For single-GPU (TP=1) the benefit is negligible.
The method applies only to MLP (feed-forward) layers in Transformer blocks as presented.
Results
Latency (Llama-70B, A100, TP=4, M=8)
Latency (Llama-70B, A100, TP=8, M=4)
Latency (Granite-20B, A100, TP=4, M=16)
Average speedup summary (Llama-70B, A100)
Who Should Care
What To Try In 7 Days
If you use GPTQ quantized models with TP, test an offline argsort-based reorder of group indices and store the permutation.
For MLP layers, permute W1 columns by the downstream permutation and benchmark end-to-end latency to check AllGather removal.
Run microbenchmarks on your cluster at TP=2,4,8 to measure real speedups and pick where to deploy the change.
Optimization Features
Infra Optimization
- works on multi-GPU A100/H100 nodes
- offline permutation reduces runtime CPU/GPU ops
Model Optimization
- quantized weights reorder (group-index argsort)
- metadata locality via contiguous groups
System Optimization
- reduce global communication across TP ranks
- improve GPU memory throughput
Inference Optimization
- avoid AllGather between column-TP and row-TP layers
- permute W1 columns with downstream permutation to align shards
- apply GPU-local dequantization to reuse metadata
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Method targets MLP layers only; attention layers need extra handling.
- Gains are small for TP=1 and grow with number of TP ranks.
- Paper reports microbenchmarks on MLP sizes, not end-to-end full-model latency.
When Not To Use
- Single-GPU inference (TP=1) where communication is absent.
- When model uses different sharding/attention patterns not matching Column-TP → Row-TP sequence.
- If you cannot offline-modify stored weight/permutation artifacts.
Failure Modes
- Incorrectly applied permutations produce wrong alignment and wrong outputs.
- Attention-layer sharding differences may reintroduce communication or require separate fixes.
- Assumes GPTQ-style group metadata; other quantization formats may not benefit.
Core Entities
Models
- Llama-70B
- Granite-20B
Metrics
- inference latency (ms)
- speedup (×)

