Overview
Well-tested on real traces and multiple GPUs; main risk is solver time and profiling error, but binary-search and heuristics make the approach practical for production experiments.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Mixing GPU types and jointly optimizing how models are deployed and routed can process more requests or cut tail latency for the same hourly cloud spend, making LLM products cheaper and more scalable.
Who Should Care
Summary TLDR
Homogeneous GPU fleets waste money because LLM requests have very different compute vs memory needs. The authors benchmark many GPU types and workloads, then build a scheduler (MILP + heuristics + binary search) that picks which GPUs to rent, how to deploy replicas (DP/TP/PP), and how to route requests. On real traces and Llama3 models their plan raises throughput up to 41% (avg ~20–25%) and cuts tail latency up to 54% (avg ~20%) under the same hourly budget.
Problem Statement
Cloud users usually rent homogeneous GPUs to serve LLMs. But requests vary widely in input/output length and so in compute vs memory needs, and cloud GPU availability and budgets fluctuate. The paper asks: can you save money and/or improve throughput by renting a mix of GPU types and jointly optimizing composition, deployment configs, and request routing under budget and availability limits?
Main Contribution
A comprehensive benchmark of LLM inference (Llama3-8B and -70B) across six common GPU types and nine workload types, measuring throughput-per-cost and latency percentiles.
A mixed-integer linear program (MILP) that jointly chooses GPU composition, per-replica deployment (DP/TP/PP), and fractional workload assignment under budget and real-time availability constraints.
Key Findings
Picking the right mix of GPU types improves cost-efficiency versus a homogeneous fleet.
Jointly tuning GPU composition, deployment config, and workload routing yields large gains.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| End-to-end throughput improvement (best case) | up to 41% higher throughput vs homogeneous baselines | homogeneous GPU fleets | up to +41% | evaluated traces (Swiss AI Center, WildChat, Azure-Trace) | §5.2, Fig.5 | Fig.5 |
| Latency reduction (best case) | up to 54% lower tail latency | homogeneous GPU fleets | up to −54% | evaluated traces (see §5.2) | §5.2, Fig.6 | Fig.6 |
What To Try In 7 Days
Profile your real request mix for input/output token lengths and classify into compute- vs memory-bound types.
Run one-time per-GPU profiling (prefill vs decode) to estimate per-config throughput as the paper does.
Simulate a small MILP or the paper's binary-search feasibility check over your budget to test a mixed-GPU plan before changing rentals.
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
MILP search space grows combinatorially; worst-case solver time can be very large for many GPU/config combinations.
One-time profiling estimates have 4–7% error; misestimates can reduce optimality.
When Not To Use
You have an ample budget and can afford top-tier homogeneous GPUs — simple homogeneous deployment may suffice.
You cannot profile per-GPU/per-config latency (e.g., no access to representative GPUs or workloads).
Failure Modes
Profiling errors lead to systematically wrong config choices and degraded throughput.
Cloud GPU availability changes or rentals are preempted before replanning, causing allocation gaps.

