Serve large LLMs on mixed-GPU clusters with phase-aware partitioning and adaptive mixed-precision quantization

March 2, 20248 min

Overview

Decision SnapshotNeeds Validation

The system is implemented and tested on 11 real clusters with code released; it targets offline batched workloads and requires offline planning time (seconds–minutes) and quantization kernels available on target GPUs.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu

Links

Abstract / PDF / Code

Why It Matters For Business

If you run batched LLM workloads, LLM-PQ lets you use mixed low- and high-end GPUs together to significantly raise throughput and lower cost while keeping model quality.

Who Should Care

Summary TLDR

LLM-PQ is a system for running large decoder-only language models on heterogeneous GPU clusters. It jointly picks per-layer quantization bits, pipeline layer splits, and micro-batch sizes using a latency+memory cost model and a variance-based sensitivity indicator. Targeting offline batch workloads with known prompt/generation lengths, LLM-PQ reports up to 2.88× throughput (2.26× average) vs strong baselines while keeping or improving model quality. The authors release code and show their latency model predicts within ~6% on held-out workloads.

Problem Statement

Existing LLM serving systems assume uniform high-end GPUs or uniform quantization. In mixed GPU pools, even layer partition and uniform compression either waste memory on big GPUs or cause OOM on small GPUs and ignore the two-phase (prefill/decode) nature of generative LLM inference. The paper asks: how to jointly choose layer placement, per-layer bitwidths, and micro-batch sizes to maximize throughput while meeting a user quality target on heterogeneous clusters?

Main Contribution

A memory and latency cost model for mixed-precision, phase-aware LLM serving that predicts memory usage and per-shard latency.

Adaptive mixed-precision added to the search space plus a lightweight variance-based indicator to rank layer sensitivity to quantization.

Key Findings

LLM-PQ consistently increases token-generation throughput versus state-of-the-art baselines by selecting mixed precisions and phase-aware partitions.

NumbersUp to 2.88× speed-up; 2.26× average speed-up (Table 4, multiple clusters).

Practical UseIf you run offline batched LLM workloads on a mixed-GPU pool, adopt LLM-PQ-style adaptive precision + phase-aware partitioning to roughly double throughput without sacrificing model quality.

Evidence RefTable 4; Sec. 6.3

Their latency cost model predicts mixed-precision per-shard latency accurately on unseen workloads.

NumbersAverage latency prediction error < 6% on 50 unseen workloads (Sec. 6.2).

Practical UseUse the paper's profiling + linear-regression approach to estimate per-layer phase costs instead of exhaustively profiling every placement.

Evidence RefSec. 6.2; Fig. 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Peak throughput improvement vs baselinesUp to 2.88× (2.26× average across evaluated clusters)PipeEdge / Uniform / FlexGen familyUp to +188% (peak)11 heterogeneous clusters (Table 4)Table 4: LLM-PQ achieves up to 2.88× throughput improvement over state-of-the-art baselines.Table 4; Sec. 6.3
Accuracy<6% average errorprofiled runtimes50 unseen workloads across BLOOM & OPT sizesSec. 6.2: average latency model error < 6%; Fig. 7 shows fidelity.Sec. 6.2; Fig. 7

What To Try In 7 Days

Audit your cluster GPU mix and pick representative workloads (prompt length, gen length).

Clone the LLM-PQ repo and run the assigner profiler on one decoder layer per GPU type.

Use the cost model to generate a plan and run a short A/B test vs your current serving setup (measure token/s and PPL).

Optimization Features

Token Efficiency
micro-batch sizing per phase
Infra Optimization
heterogeneous GPU utilizationdevice ordering enumeration
Model Optimization
mixed-precision quantizationweight-only quantization support
System Optimization
latency and memory cost modelsvariance-based layer sensitivity indicatorILP + heuristic optimizer
Inference Optimization
phase-aware partition (prefill vs decode)micro-batch schedulingpipeline-parallel placementon-the-fly quantized weight loading

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Designed for offline batched workloads with known prompt and generation lengths, not for open-ended online serving.

Requires per-GPU kernels for multiple precisions (INT3/4/8) and their performance characteristics.

When Not To Use

For unpredictable online workloads where prompt lengths vary widely and KV paging matters (vLLM-style online task).

When all requests fit easily on a single high-memory GPU (no partition benefits).

Failure Modes

OOM if KV cache dominates memory and quantization choices cannot free enough memory.

Heuristic may converge to suboptimal plans if its starting point is poor.

Core Entities

Models

OPT-13bOPT-30bOPT-66bOPT-175bBLOOM-3bBLOOM-176bOPT-30b (used in per-layer timing)

Metrics

Throughput (token/s)Latency (s per batch)Perplexity (PPL)Memory usage (GPU bytes)Latency prediction error (%)

Datasets

WikiText2Penn Treebank (PTB)C4ShareGPT (prompt length distribution)

Benchmarks

Perplexity (PPL) on WikiText2/PTB/C4Token generation throughput (tokens/s)End-to-end serving latency