Mix GPU types, tune deployments, and route workloads to cut LLM serving cost and boost throughput

February 2, 20257 min

Overview

Decision SnapshotNeeds Validation

Well-tested on real traces and multiple GPUs; main risk is solver time and profiling error, but binary-search and heuristics make the approach practical for production experiments.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, Eiko Yoneki

Links

Abstract / PDF

Why It Matters For Business

Mixing GPU types and jointly optimizing how models are deployed and routed can process more requests or cut tail latency for the same hourly cloud spend, making LLM products cheaper and more scalable.

Who Should Care

Summary TLDR

Homogeneous GPU fleets waste money because LLM requests have very different compute vs memory needs. The authors benchmark many GPU types and workloads, then build a scheduler (MILP + heuristics + binary search) that picks which GPUs to rent, how to deploy replicas (DP/TP/PP), and how to route requests. On real traces and Llama3 models their plan raises throughput up to 41% (avg ~20–25%) and cuts tail latency up to 54% (avg ~20%) under the same hourly budget.

Problem Statement

Cloud users usually rent homogeneous GPUs to serve LLMs. But requests vary widely in input/output length and so in compute vs memory needs, and cloud GPU availability and budgets fluctuate. The paper asks: can you save money and/or improve throughput by renting a mix of GPU types and jointly optimizing composition, deployment configs, and request routing under budget and availability limits?

Main Contribution

A comprehensive benchmark of LLM inference (Llama3-8B and -70B) across six common GPU types and nine workload types, measuring throughput-per-cost and latency percentiles.

A mixed-integer linear program (MILP) that jointly chooses GPU composition, per-replica deployment (DP/TP/PP), and fractional workload assignment under budget and real-time availability constraints.

Key Findings

Picking the right mix of GPU types improves cost-efficiency versus a homogeneous fleet.

Numbersup to 2.27× improvement in throughput-per-cost (benchmarking)

Practical UseRent a mix of data-center, workstation, and consumer GPUs and match each workload type to the GPU that fits its compute/memory profile to increase throughput per dollar.

Evidence RefObservation-1; Fig.3, Fig.11

Jointly tuning GPU composition, deployment config, and workload routing yields large gains.

Numberstoy example: composition +20% / config +14% / assignment +8%

Practical UseDon’t adjust only one lever. Co-optimize which GPUs you rent, how replicas are parallelized, and how requests are routed for best end-to-end speed.

Evidence Ref§4.2 simple example

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
End-to-end throughput improvement (best case)up to 41% higher throughput vs homogeneous baselineshomogeneous GPU fleetsup to +41%evaluated traces (Swiss AI Center, WildChat, Azure-Trace)§5.2, Fig.5Fig.5
Latency reduction (best case)up to 54% lower tail latencyhomogeneous GPU fleetsup to −54%evaluated traces (see §5.2)§5.2, Fig.6Fig.6

What To Try In 7 Days

Profile your real request mix for input/output token lengths and classify into compute- vs memory-bound types.

Run one-time per-GPU profiling (prefill vs decode) to estimate per-config throughput as the paper does.

Simulate a small MILP or the paper's binary-search feasibility check over your budget to test a mixed-GPU plan before changing rentals.

Optimization Features

Infra Optimization
Budget-aware provisioningAvailability-aware renting
System Optimization
GPU UtilizationModel Routing
Inference Optimization
Efficient InferenceDistributed InferenceLatency Optimization

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

MILP search space grows combinatorially; worst-case solver time can be very large for many GPU/config combinations.

One-time profiling estimates have 4–7% error; misestimates can reduce optimality.

When Not To Use

You have an ample budget and can afford top-tier homogeneous GPUs — simple homogeneous deployment may suffice.

You cannot profile per-GPU/per-config latency (e.g., no access to representative GPUs or workloads).

Failure Modes

Profiling errors lead to systematically wrong config choices and degraded throughput.

Cloud GPU availability changes or rentals are preempted before replanning, causing allocation gaps.

Core Entities

Models

Llama3-8BLlama3-70B

Metrics

requests/secthroughput per $/hp90/p99/p100 latencytotal cost = latency × GPU price

Datasets

ShareGPTWildGPT / WildChatAzure-TraceSwiss AI Center traces (private)

Benchmarks

Throughput-per-cost (throughput / GPU price)Latency percentiles (P5–P100)Makespan on workload batches

Context Entities

Models

GPT-4, Gemini, Claude (mentioned as examples)

Datasets

BurstGPT / production traces referenced in related work