Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
3
Why It Matters For Business
If you run batched LLM workloads, LLM-PQ lets you use mixed low- and high-end GPUs together to significantly raise throughput and lower cost while keeping model quality.
Summary TLDR
LLM-PQ is a system for running large decoder-only language models on heterogeneous GPU clusters. It jointly picks per-layer quantization bits, pipeline layer splits, and micro-batch sizes using a latency+memory cost model and a variance-based sensitivity indicator. Targeting offline batch workloads with known prompt/generation lengths, LLM-PQ reports up to 2.88× throughput (2.26× average) vs strong baselines while keeping or improving model quality. The authors release code and show their latency model predicts within ~6% on held-out workloads.
Problem Statement
Existing LLM serving systems assume uniform high-end GPUs or uniform quantization. In mixed GPU pools, even layer partition and uniform compression either waste memory on big GPUs or cause OOM on small GPUs and ignore the two-phase (prefill/decode) nature of generative LLM inference. The paper asks: how to jointly choose layer placement, per-layer bitwidths, and micro-batch sizes to maximize throughput while meeting a user quality target on heterogeneous clusters?
Main Contribution
A memory and latency cost model for mixed-precision, phase-aware LLM serving that predicts memory usage and per-shard latency.
Adaptive mixed-precision added to the search space plus a lightweight variance-based indicator to rank layer sensitivity to quantization.
An optimizer that enumerates device orderings and micro-batches, then solves an ILP (with a practical heuristic) to assign layer partitions and bitwidths.
A prototype runtime with on-the-fly quantized loading, a thread-safe micro-batch scheduler, and experiments on 11 real clusters; code published on GitHub.
Key Findings
LLM-PQ consistently increases token-generation throughput versus state-of-the-art baselines by selecting mixed precisions and phase-aware partitions.
Their latency cost model predicts mixed-precision per-shard latency accurately on unseen workloads.
A lightweight variance indicator captures layer sensitivity to weight-only quantization and helps keep model quality high.
Solver and search overhead is moderate and manageable for offline planning.
Results
Peak throughput improvement vs baselines
Accuracy
Accuracy
Model quality (perplexity)
Optimizer solve time (assigner)
Who Should Care
What To Try In 7 Days
Audit your cluster GPU mix and pick representative workloads (prompt length, gen length).
Clone the LLM-PQ repo and run the assigner profiler on one decoder layer per GPU type.
Use the cost model to generate a plan and run a short A/B test vs your current serving setup (measure token/s and PPL).
Optimization Features
Token Efficiency
- micro-batch sizing per phase
Infra Optimization
- heterogeneous GPU utilization
- device ordering enumeration
Model Optimization
- mixed-precision quantization
- weight-only quantization support
System Optimization
- latency and memory cost models
- variance-based layer sensitivity indicator
- ILP + heuristic optimizer
Inference Optimization
- phase-aware partition (prefill vs decode)
- micro-batch scheduling
- pipeline-parallel placement
- on-the-fly quantized weight loading
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Designed for offline batched workloads with known prompt and generation lengths, not for open-ended online serving.
- Requires per-GPU kernels for multiple precisions (INT3/4/8) and their performance characteristics.
- ILP search can take up to minutes on large clusters; heuristics reduce but do not eliminate overhead.
- Does not include tensor-parallel search in the implemented prototype (discussion only).
When Not To Use
- For unpredictable online workloads where prompt lengths vary widely and KV paging matters (vLLM-style online task).
- When all requests fit easily on a single high-memory GPU (no partition benefits).
- If your cluster lacks mixed-precision kernel support or GPU types used in the study.
Failure Modes
- OOM if KV cache dominates memory and quantization choices cannot free enough memory.
- Heuristic may converge to suboptimal plans if its starting point is poor.
- Uniform low-bit quantization may slow inference on GPUs where low-bit kernels are slower than FP16.
- High solver overhead on very large device counts without grouping/heuristics.
Core Entities
Models
- OPT-13b
- OPT-30b
- OPT-66b
- OPT-175b
- BLOOM-3b
- BLOOM-176b
- OPT-30b (used in per-layer timing)
Metrics
- Throughput (token/s)
- Latency (s per batch)
- Perplexity (PPL)
- Memory usage (GPU bytes)
- Latency prediction error (%)
Datasets
- WikiText2
- Penn Treebank (PTB)
- C4
- ShareGPT (prompt length distribution)
Benchmarks
- Perplexity (PPL) on WikiText2/PTB/C4
- Token generation throughput (tokens/s)
- End-to-end serving latency

