Serve large LLMs on mixed-GPU clusters with phase-aware partitioning and adaptive mixed-precision quantization

Overview

Decision SnapshotNeeds Validation

The system is implemented and tested on 11 real clusters with code released; it targets offline batched workloads and requires offline planning time (seconds–minutes) and quantization kernels available on target GPUs.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu

Links

Abstract / PDF / Code

Why It Matters For Business

If you run batched LLM workloads, LLM-PQ lets you use mixed low- and high-end GPUs together to significantly raise throughput and lower cost while keeping model quality.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

LLM-PQ is a system for running large decoder-only language models on heterogeneous GPU clusters. It jointly picks per-layer quantization bits, pipeline layer splits, and micro-batch sizes using a latency+memory cost model and a variance-based sensitivity indicator. Targeting offline batch workloads with known prompt/generation lengths, LLM-PQ reports up to 2.88× throughput (2.26× average) vs strong baselines while keeping or improving model quality. The authors release code and show their latency model predicts within ~6% on held-out workloads.

Problem Statement

Existing LLM serving systems assume uniform high-end GPUs or uniform quantization. In mixed GPU pools, even layer partition and uniform compression either waste memory on big GPUs or cause OOM on small GPUs and ignore the two-phase (prefill/decode) nature of generative LLM inference. The paper asks: how to jointly choose layer placement, per-layer bitwidths, and micro-batch sizes to maximize throughput while meeting a user quality target on heterogeneous clusters?

Main Contribution

A memory and latency cost model for mixed-precision, phase-aware LLM serving that predicts memory usage and per-shard latency.

Adaptive mixed-precision added to the search space plus a lightweight variance-based indicator to rank layer sensitivity to quantization.

Key Findings

LLM-PQ consistently increases token-generation throughput versus state-of-the-art baselines by selecting mixed precisions and phase-aware partitions.

NumbersUp to 2.88× speed-up; 2.26× average speed-up (Table 4, multiple clusters).

Practical UseIf you run offline batched LLM workloads on a mixed-GPU pool, adopt LLM-PQ-style adaptive precision + phase-aware partitioning to roughly double throughput without sacrificing model quality.

Evidence RefTable 4; Sec. 6.3

Their latency cost model predicts mixed-precision per-shard latency accurately on unseen workloads.

NumbersAverage latency prediction error < 6% on 50 unseen workloads (Sec. 6.2).

Practical UseUse the paper's profiling + linear-regression approach to estimate per-layer phase costs instead of exhaustively profiling every placement.

Evidence RefSec. 6.2; Fig. 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Peak throughput improvement vs baselines	Up to 2.88× (2.26× average across evaluated clusters)	PipeEdge / Uniform / FlexGen family	Up to +188% (peak)	11 heterogeneous clusters (Table 4)	Table 4: LLM-PQ achieves up to 2.88× throughput improvement over state-of-the-art baselines.	Table 4; Sec. 6.3
Accuracy	<6% average error	profiled runtimes	—	50 unseen workloads across BLOOM & OPT sizes	Sec. 6.2: average latency model error < 6%; Fig. 7 shows fidelity.	Sec. 6.2; Fig. 7

What To Try In 7 Days

Audit your cluster GPU mix and pick representative workloads (prompt length, gen length).

Clone the LLM-PQ repo and run the assigner profiler on one decoder layer per GPU type.

Use the cost model to generate a plan and run a short A/B test vs your current serving setup (measure token/s and PPL).

Optimization Features

Token Efficiency

micro-batch sizing per phase

Infra Optimization

heterogeneous GPU utilizationdevice ordering enumeration

Model Optimization

mixed-precision quantizationweight-only quantization support

System Optimization

latency and memory cost modelsvariance-based layer sensitivity indicatorILP + heuristic optimizer

Inference Optimization

phase-aware partition (prefill vs decode)micro-batch schedulingpipeline-parallel placementon-the-fly quantized weight loading

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/tonyzhao-jt/LLM-PQ

Risks & Boundaries

Limitations

Designed for offline batched workloads with known prompt and generation lengths, not for open-ended online serving.

Requires per-GPU kernels for multiple precisions (INT3/4/8) and their performance characteristics.

When Not To Use

For unpredictable online workloads where prompt lengths vary widely and KV paging matters (vLLM-style online task).

When all requests fit easily on a single high-memory GPU (no partition benefits).

Failure Modes

OOM if KV cache dominates memory and quantization choices cannot free enough memory.

Heuristic may converge to suboptimal plans if its starting point is poor.

Core Entities

Models

OPT-13bOPT-30bOPT-66bOPT-175bBLOOM-3bBLOOM-176bOPT-30b (used in per-layer timing)

Metrics

Throughput (token/s)Latency (s per batch)Perplexity (PPL)Memory usage (GPU bytes)Latency prediction error (%)

Datasets

WikiText2Penn Treebank (PTB)C4ShareGPT (prompt length distribution)

Benchmarks

Perplexity (PPL) on WikiText2/PTB/C4Token generation throughput (tokens/s)End-to-end serving latency

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM-PQ consistently increases token-generation throughput versus state-of-the-art baselines by selecting mixed precisions and phase-aware partitions.

Their latency cost model predicts mixed-precision per-shard latency accurately on unseen workloads.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding