Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Mixing GPU types and jointly optimizing how models are deployed and routed can process more requests or cut tail latency for the same hourly cloud spend, making LLM products cheaper and more scalable.
Summary TLDR
Homogeneous GPU fleets waste money because LLM requests have very different compute vs memory needs. The authors benchmark many GPU types and workloads, then build a scheduler (MILP + heuristics + binary search) that picks which GPUs to rent, how to deploy replicas (DP/TP/PP), and how to route requests. On real traces and Llama3 models their plan raises throughput up to 41% (avg ~20–25%) and cuts tail latency up to 54% (avg ~20%) under the same hourly budget.
Problem Statement
Cloud users usually rent homogeneous GPUs to serve LLMs. But requests vary widely in input/output length and so in compute vs memory needs, and cloud GPU availability and budgets fluctuate. The paper asks: can you save money and/or improve throughput by renting a mix of GPU types and jointly optimizing composition, deployment configs, and request routing under budget and availability limits?
Main Contribution
A comprehensive benchmark of LLM inference (Llama3-8B and -70B) across six common GPU types and nine workload types, measuring throughput-per-cost and latency percentiles.
A mixed-integer linear program (MILP) that jointly chooses GPU composition, per-replica deployment (DP/TP/PP), and fractional workload assignment under budget and real-time availability constraints.
Practical speedups (heuristics, connectivity/memory pruning, and a binary-search-on-makespan) plus a multi-model extension and empirical evaluation on real cloud availabilities and traces.
Key Findings
Picking the right mix of GPU types improves cost-efficiency versus a homogeneous fleet.
Jointly tuning GPU composition, deployment config, and workload routing yields large gains.
End-to-end scheduler outperforms common baselines on real traces and budgets.
Algorithmic speed / scalability trade-off: binary-search feasibility check speeds up search.
Results
End-to-end throughput improvement (best case)
Latency reduction (best case)
HexGen comparison
Helix comparison (Azure-Trace, $15/h)
Algorithm search speed
Who Should Care
What To Try In 7 Days
Profile your real request mix for input/output token lengths and classify into compute- vs memory-bound types.
Run one-time per-GPU profiling (prefill vs decode) to estimate per-config throughput as the paper does.
Simulate a small MILP or the paper's binary-search feasibility check over your budget to test a mixed-GPU plan before changing rentals.
Optimization Features
Infra Optimization
- Budget-aware provisioning
- Availability-aware renting
System Optimization
- GPU Utilization
- Model Routing
Inference Optimization
- Efficient Inference
- Distributed Inference
- Latency Optimization
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- MILP search space grows combinatorially; worst-case solver time can be very large for many GPU/config combinations.
- One-time profiling estimates have 4–7% error; misestimates can reduce optimality.
- Experiments assume access to cloud availability snapshots; rapid availability changes require replanning.
- Connectivity and intra-machine bandwidth assumptions limit some cross-machine parallelism.
When Not To Use
- You have an ample budget and can afford top-tier homogeneous GPUs — simple homogeneous deployment may suffice.
- You cannot profile per-GPU/per-config latency (e.g., no access to representative GPUs or workloads).
- Workloads change so fast that replanning overhead outweighs scheduling gains without an online replanning layer.
Failure Modes
- Profiling errors lead to systematically wrong config choices and degraded throughput.
- Cloud GPU availability changes or rentals are preempted before replanning, causing allocation gaps.
- MILP solver times out or is given too coarse heuristics, yielding suboptimal plans.
- Routing too many requests to a locally optimal replica causes queuing and higher tail latency.
Core Entities
Models
- Llama3-8B
- Llama3-70B
Metrics
- requests/sec
- throughput per $/h
- p90/p99/p100 latency
- total cost = latency × GPU price
Datasets
- ShareGPT
- WildGPT / WildChat
- Azure-Trace
- Swiss AI Center traces (private)
Benchmarks
- Throughput-per-cost (throughput / GPU price)
- Latency percentiles (P5–P100)
- Makespan on workload batches
Context Entities
Models
- GPT-4, Gemini, Claude (mentioned as examples)
Datasets
- BurstGPT / production traces referenced in related work

