Mix GPU types, tune deployments, and route workloads to cut LLM serving cost and boost throughput

February 2, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

0

Authors

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, Eiko Yoneki

Links

Abstract / PDF

Why It Matters For Business

Mixing GPU types and jointly optimizing how models are deployed and routed can process more requests or cut tail latency for the same hourly cloud spend, making LLM products cheaper and more scalable.

Summary TLDR

Homogeneous GPU fleets waste money because LLM requests have very different compute vs memory needs. The authors benchmark many GPU types and workloads, then build a scheduler (MILP + heuristics + binary search) that picks which GPUs to rent, how to deploy replicas (DP/TP/PP), and how to route requests. On real traces and Llama3 models their plan raises throughput up to 41% (avg ~20–25%) and cuts tail latency up to 54% (avg ~20%) under the same hourly budget.

Problem Statement

Cloud users usually rent homogeneous GPUs to serve LLMs. But requests vary widely in input/output length and so in compute vs memory needs, and cloud GPU availability and budgets fluctuate. The paper asks: can you save money and/or improve throughput by renting a mix of GPU types and jointly optimizing composition, deployment configs, and request routing under budget and availability limits?

Main Contribution

A comprehensive benchmark of LLM inference (Llama3-8B and -70B) across six common GPU types and nine workload types, measuring throughput-per-cost and latency percentiles.

A mixed-integer linear program (MILP) that jointly chooses GPU composition, per-replica deployment (DP/TP/PP), and fractional workload assignment under budget and real-time availability constraints.

Practical speedups (heuristics, connectivity/memory pruning, and a binary-search-on-makespan) plus a multi-model extension and empirical evaluation on real cloud availabilities and traces.

Key Findings

Picking the right mix of GPU types improves cost-efficiency versus a homogeneous fleet.

Numbersup to 2.27× improvement in throughput-per-cost (benchmarking)

Jointly tuning GPU composition, deployment config, and workload routing yields large gains.

Numberstoy example: composition +20% / config +14% / assignment +8%

End-to-end scheduler outperforms common baselines on real traces and budgets.

Numbersup to 41% throughput ↑, avg ~20–25%; latency reduction up to 54%, avg ~20%

Algorithmic speed / scalability trade-off: binary-search feasibility check speeds up search.

Numbers≈4× reduction in search time with <1% performance degradation

Results

End-to-end throughput improvement (best case)

Valueup to 41% higher throughput vs homogeneous baselines

Baselinehomogeneous GPU fleets

Latency reduction (best case)

Valueup to 54% lower tail latency

Baselinehomogeneous GPU fleets

HexGen comparison

Valueour method outperforms HexGen by up to 18%, avg ~14%

BaselineHexGen (heterogeneous baseline)

Helix comparison (Azure-Trace, $15/h)

ValueHelix 5.72 req/s vs Ours 7.13 req/s

BaselineHelix

Algorithm search speed

Valuebinary search ≈4× faster than direct MILP

BaselineMILP branch-and-bound

Who Should Care

What To Try In 7 Days

Profile your real request mix for input/output token lengths and classify into compute- vs memory-bound types.

Run one-time per-GPU profiling (prefill vs decode) to estimate per-config throughput as the paper does.

Simulate a small MILP or the paper's binary-search feasibility check over your budget to test a mixed-GPU plan before changing rentals.

Optimization Features

Infra Optimization

  • Budget-aware provisioning
  • Availability-aware renting

System Optimization

  • GPU Utilization
  • Model Routing

Inference Optimization

  • Efficient Inference
  • Distributed Inference
  • Latency Optimization

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • MILP search space grows combinatorially; worst-case solver time can be very large for many GPU/config combinations.
  • One-time profiling estimates have 4–7% error; misestimates can reduce optimality.
  • Experiments assume access to cloud availability snapshots; rapid availability changes require replanning.
  • Connectivity and intra-machine bandwidth assumptions limit some cross-machine parallelism.

When Not To Use

  • You have an ample budget and can afford top-tier homogeneous GPUs — simple homogeneous deployment may suffice.
  • You cannot profile per-GPU/per-config latency (e.g., no access to representative GPUs or workloads).
  • Workloads change so fast that replanning overhead outweighs scheduling gains without an online replanning layer.

Failure Modes

  • Profiling errors lead to systematically wrong config choices and degraded throughput.
  • Cloud GPU availability changes or rentals are preempted before replanning, causing allocation gaps.
  • MILP solver times out or is given too coarse heuristics, yielding suboptimal plans.
  • Routing too many requests to a locally optimal replica causes queuing and higher tail latency.

Core Entities

Models

  • Llama3-8B
  • Llama3-70B

Metrics

  • requests/sec
  • throughput per $/h
  • p90/p99/p100 latency
  • total cost = latency × GPU price

Datasets

  • ShareGPT
  • WildGPT / WildChat
  • Azure-Trace
  • Swiss AI Center traces (private)

Benchmarks

  • Throughput-per-cost (throughput / GPU price)
  • Latency percentiles (P5–P100)
  • Makespan on workload batches

Context Entities

Models

  • GPT-4, Gemini, Claude (mentioned as examples)

Datasets

  • BurstGPT / production traces referenced in related work