Overview
The system is implemented and tested on 11 real clusters with code released; it targets offline batched workloads and requires offline planning time (seconds–minutes) and quantization kernels available on target GPUs.
Citations3
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
If you run batched LLM workloads, LLM-PQ lets you use mixed low- and high-end GPUs together to significantly raise throughput and lower cost while keeping model quality.
Who Should Care
Summary TLDR
LLM-PQ is a system for running large decoder-only language models on heterogeneous GPU clusters. It jointly picks per-layer quantization bits, pipeline layer splits, and micro-batch sizes using a latency+memory cost model and a variance-based sensitivity indicator. Targeting offline batch workloads with known prompt/generation lengths, LLM-PQ reports up to 2.88× throughput (2.26× average) vs strong baselines while keeping or improving model quality. The authors release code and show their latency model predicts within ~6% on held-out workloads.
Problem Statement
Existing LLM serving systems assume uniform high-end GPUs or uniform quantization. In mixed GPU pools, even layer partition and uniform compression either waste memory on big GPUs or cause OOM on small GPUs and ignore the two-phase (prefill/decode) nature of generative LLM inference. The paper asks: how to jointly choose layer placement, per-layer bitwidths, and micro-batch sizes to maximize throughput while meeting a user quality target on heterogeneous clusters?
Main Contribution
A memory and latency cost model for mixed-precision, phase-aware LLM serving that predicts memory usage and per-shard latency.
Adaptive mixed-precision added to the search space plus a lightweight variance-based indicator to rank layer sensitivity to quantization.
Key Findings
LLM-PQ consistently increases token-generation throughput versus state-of-the-art baselines by selecting mixed precisions and phase-aware partitions.
Their latency cost model predicts mixed-precision per-shard latency accurately on unseen workloads.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Peak throughput improvement vs baselines | Up to 2.88× (2.26× average across evaluated clusters) | PipeEdge / Uniform / FlexGen family | Up to +188% (peak) | 11 heterogeneous clusters (Table 4) | Table 4: LLM-PQ achieves up to 2.88× throughput improvement over state-of-the-art baselines. | Table 4; Sec. 6.3 |
| Accuracy | <6% average error | profiled runtimes | — | 50 unseen workloads across BLOOM & OPT sizes | Sec. 6.2: average latency model error < 6%; Fig. 7 shows fidelity. | Sec. 6.2; Fig. 7 |
What To Try In 7 Days
Audit your cluster GPU mix and pick representative workloads (prompt length, gen length).
Clone the LLM-PQ repo and run the assigner profiler on one decoder layer per GPU type.
Use the cost model to generate a plan and run a short A/B test vs your current serving setup (measure token/s and PPL).
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Designed for offline batched workloads with known prompt and generation lengths, not for open-ended online serving.
Requires per-GPU kernels for multiple precisions (INT3/4/8) and their performance characteristics.
When Not To Use
For unpredictable online workloads where prompt lengths vary widely and KV paging matters (vLLM-style online task).
When all requests fit easily on a single high-memory GPU (no partition benefits).
Failure Modes
OOM if KV cache dominates memory and quantization choices cannot free enough memory.
Heuristic may converge to suboptimal plans if its starting point is poor.

