Reorder quantized weights to avoid inter-GPU communication and cut LLM inference latency up to ~1.8x

January 15, 20246 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti

Links

Abstract / PDF

Why It Matters For Business

A low-complexity, offline reorder can cut inter-GPU communication and speed up quantized LLM inference, lowering latency and increasing throughput for multi-GPU serving without changing model weights.

Summary TLDR

TP-Aware Dequantization rearranges quantized weight storage so GPU-local metadata can be reused and a costly AllGather between column- and row-sharded layers is avoided. Applied to GPTQ-style 4-bit weights, the method targets MLP layers in Transformer blocks and yields up to ~1.8x latency speedups on Llama-70B and Granite-20B across A100/H100 multi-GPU nodes. The change is an offline reorder (permutation) of weight columns so dequantization and GEMM stay local, reducing global communication.

Problem Statement

GPTQ-style quantization stores per-group scales/zeros and an index mapping rows to groups. When GPTQ's activation-order reordering is used, metadata lookups become scattered and, under Tensor Parallel (TP), force extra AllGather communication between column- and row-sharded layers. This increases inference latency and reduces throughput for large LLMs in multi-GPU deployments.

Main Contribution

Identify that GPTQ act_order reordering breaks GPU data locality and causes extra AllGather in TP setups.

Propose an offline permutation-based reorder (argsort) to make group metadata consecutive and cache-friendly.

Introduce a TP-aware trick: permute columns of the first (column-TP) weight shard with the second layer's permutation so the inter-layer AllGather is not needed for MLP layers.

Measure end-to-end latency gains on Llama-70B and Granite-20B MLP sizes across A100 and H100 DGX nodes with up to ~1.8x speedups.

Key Findings

TP-Aware Dequantization speeds up MLP-layer inference in distributed LLMs.

Numbersup to 1.81x (Llama-70B, A100) and up to 1.83x (Granite-20B, A100)

Speedup grows with more tensor-parallel ranks.

Numbersaverage speedup ~1.22x (TP=2) → ~1.81x (TP=8) for Llama-70B on A100

For single-GPU (TP=1) the benefit is negligible.

Numberslatency changes: 0.696→0.688 ms (A100) and 0.489→0.481 ms (H100)

The method applies only to MLP (feed-forward) layers in Transformer blocks as presented.

Results

Latency (Llama-70B, A100, TP=4, M=8)

ValueNaive 0.518 ms → TP-Aware 0.285 ms

BaselineNaive Algorithm 0.518 ms

Latency (Llama-70B, A100, TP=8, M=4)

ValueNaive 0.539 ms → TP-Aware 0.291 ms

BaselineNaive Algorithm 0.539 ms

Latency (Granite-20B, A100, TP=4, M=16)

ValueNaive 0.53 ms → TP-Aware 0.29 ms

BaselineNaive Algorithm 0.53 ms

Average speedup summary (Llama-70B, A100)

ValueTP=2:1.22x, TP=4:1.78x, TP=8:1.81x (averages reported)

BaselineNaive Algorithm

Who Should Care

What To Try In 7 Days

If you use GPTQ quantized models with TP, test an offline argsort-based reorder of group indices and store the permutation.

For MLP layers, permute W1 columns by the downstream permutation and benchmark end-to-end latency to check AllGather removal.

Run microbenchmarks on your cluster at TP=2,4,8 to measure real speedups and pick where to deploy the change.

Optimization Features

Infra Optimization

  • works on multi-GPU A100/H100 nodes
  • offline permutation reduces runtime CPU/GPU ops

Model Optimization

  • quantized weights reorder (group-index argsort)
  • metadata locality via contiguous groups

System Optimization

  • reduce global communication across TP ranks
  • improve GPU memory throughput

Inference Optimization

  • avoid AllGather between column-TP and row-TP layers
  • permute W1 columns with downstream permutation to align shards
  • apply GPU-local dequantization to reuse metadata

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Method targets MLP layers only; attention layers need extra handling.
  • Gains are small for TP=1 and grow with number of TP ranks.
  • Paper reports microbenchmarks on MLP sizes, not end-to-end full-model latency.

When Not To Use

  • Single-GPU inference (TP=1) where communication is absent.
  • When model uses different sharding/attention patterns not matching Column-TP → Row-TP sequence.
  • If you cannot offline-modify stored weight/permutation artifacts.

Failure Modes

  • Incorrectly applied permutations produce wrong alignment and wrong outputs.
  • Attention-layer sharding differences may reintroduce communication or require separate fixes.
  • Assumes GPTQ-style group metadata; other quantization formats may not benefit.

Core Entities

Models

  • Llama-70B
  • Granite-20B

Metrics

  • inference latency (ms)
  • speedup (×)