Serve large LLMs on mixed-GPU clusters with phase-aware partitioning and adaptive mixed-precision quantization

March 2, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

3

Authors

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu

Links

Abstract / PDF

Why It Matters For Business

If you run batched LLM workloads, LLM-PQ lets you use mixed low- and high-end GPUs together to significantly raise throughput and lower cost while keeping model quality.

Summary TLDR

LLM-PQ is a system for running large decoder-only language models on heterogeneous GPU clusters. It jointly picks per-layer quantization bits, pipeline layer splits, and micro-batch sizes using a latency+memory cost model and a variance-based sensitivity indicator. Targeting offline batch workloads with known prompt/generation lengths, LLM-PQ reports up to 2.88× throughput (2.26× average) vs strong baselines while keeping or improving model quality. The authors release code and show their latency model predicts within ~6% on held-out workloads.

Problem Statement

Existing LLM serving systems assume uniform high-end GPUs or uniform quantization. In mixed GPU pools, even layer partition and uniform compression either waste memory on big GPUs or cause OOM on small GPUs and ignore the two-phase (prefill/decode) nature of generative LLM inference. The paper asks: how to jointly choose layer placement, per-layer bitwidths, and micro-batch sizes to maximize throughput while meeting a user quality target on heterogeneous clusters?

Main Contribution

A memory and latency cost model for mixed-precision, phase-aware LLM serving that predicts memory usage and per-shard latency.

Adaptive mixed-precision added to the search space plus a lightweight variance-based indicator to rank layer sensitivity to quantization.

An optimizer that enumerates device orderings and micro-batches, then solves an ILP (with a practical heuristic) to assign layer partitions and bitwidths.

A prototype runtime with on-the-fly quantized loading, a thread-safe micro-batch scheduler, and experiments on 11 real clusters; code published on GitHub.

Key Findings

LLM-PQ consistently increases token-generation throughput versus state-of-the-art baselines by selecting mixed precisions and phase-aware partitions.

NumbersUp to 2.88× speed-up; 2.26× average speed-up (Table 4, multiple clusters).

Their latency cost model predicts mixed-precision per-shard latency accurately on unseen workloads.

NumbersAverage latency prediction error < 6% on 50 unseen workloads (Sec. 6.2).

A lightweight variance indicator captures layer sensitivity to weight-only quantization and helps keep model quality high.

NumbersPerplexity matched or slightly improved versus baselines (e.g., PPL change −0.02 on OPT-66b, Table 6).

Solver and search overhead is moderate and manageable for offline planning.

NumbersAverage optimization time 18.38s; worst-case 115.98s (Table 10).

Results

Peak throughput improvement vs baselines

ValueUp to 2.88× (2.26× average across evaluated clusters)

BaselinePipeEdge / Uniform / FlexGen family

Accuracy

Value<6% average error

Baselineprofiled runtimes

Accuracy

ValueNegligible error (near-accurate)

Baselinemeasured system memory

Model quality (perplexity)

ValueMaintained or slightly improved

BaselineFP16 / uniform quantization baselines

Optimizer solve time (assigner)

ValueAverage 18.38s; worst 115.98s

BaselineGurobi ILP runs and heuristics

Who Should Care

What To Try In 7 Days

Audit your cluster GPU mix and pick representative workloads (prompt length, gen length).

Clone the LLM-PQ repo and run the assigner profiler on one decoder layer per GPU type.

Use the cost model to generate a plan and run a short A/B test vs your current serving setup (measure token/s and PPL).

Optimization Features

Token Efficiency

  • micro-batch sizing per phase

Infra Optimization

  • heterogeneous GPU utilization
  • device ordering enumeration

Model Optimization

  • mixed-precision quantization
  • weight-only quantization support

System Optimization

  • latency and memory cost models
  • variance-based layer sensitivity indicator
  • ILP + heuristic optimizer

Inference Optimization

  • phase-aware partition (prefill vs decode)
  • micro-batch scheduling
  • pipeline-parallel placement
  • on-the-fly quantized weight loading

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Designed for offline batched workloads with known prompt and generation lengths, not for open-ended online serving.
  • Requires per-GPU kernels for multiple precisions (INT3/4/8) and their performance characteristics.
  • ILP search can take up to minutes on large clusters; heuristics reduce but do not eliminate overhead.
  • Does not include tensor-parallel search in the implemented prototype (discussion only).

When Not To Use

  • For unpredictable online workloads where prompt lengths vary widely and KV paging matters (vLLM-style online task).
  • When all requests fit easily on a single high-memory GPU (no partition benefits).
  • If your cluster lacks mixed-precision kernel support or GPU types used in the study.

Failure Modes

  • OOM if KV cache dominates memory and quantization choices cannot free enough memory.
  • Heuristic may converge to suboptimal plans if its starting point is poor.
  • Uniform low-bit quantization may slow inference on GPUs where low-bit kernels are slower than FP16.
  • High solver overhead on very large device counts without grouping/heuristics.

Core Entities

Models

  • OPT-13b
  • OPT-30b
  • OPT-66b
  • OPT-175b
  • BLOOM-3b
  • BLOOM-176b
  • OPT-30b (used in per-layer timing)

Metrics

  • Throughput (token/s)
  • Latency (s per batch)
  • Perplexity (PPL)
  • Memory usage (GPU bytes)
  • Latency prediction error (%)

Datasets

  • WikiText2
  • Penn Treebank (PTB)
  • C4
  • ShareGPT (prompt length distribution)

Benchmarks

  • Perplexity (PPL) on WikiText2/PTB/C4
  • Token generation throughput (tokens/s)
  • End-to-end serving latency