Mix GPU types, tune deployments, and route workloads to cut LLM serving cost and boost throughput

Overview

Decision SnapshotNeeds Validation

Well-tested on real traces and multiple GPUs; main risk is solver time and profiling error, but binary-search and heuristics make the approach practical for production experiments.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, Eiko Yoneki

Links

Abstract / PDF

Why It Matters For Business

Mixing GPU types and jointly optimizing how models are deployed and routed can process more requests or cut tail latency for the same hourly cloud spend, making LLM products cheaper and more scalable.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

Homogeneous GPU fleets waste money because LLM requests have very different compute vs memory needs. The authors benchmark many GPU types and workloads, then build a scheduler (MILP + heuristics + binary search) that picks which GPUs to rent, how to deploy replicas (DP/TP/PP), and how to route requests. On real traces and Llama3 models their plan raises throughput up to 41% (avg ~20–25%) and cuts tail latency up to 54% (avg ~20%) under the same hourly budget.

Problem Statement

Cloud users usually rent homogeneous GPUs to serve LLMs. But requests vary widely in input/output length and so in compute vs memory needs, and cloud GPU availability and budgets fluctuate. The paper asks: can you save money and/or improve throughput by renting a mix of GPU types and jointly optimizing composition, deployment configs, and request routing under budget and availability limits?

Main Contribution

A comprehensive benchmark of LLM inference (Llama3-8B and -70B) across six common GPU types and nine workload types, measuring throughput-per-cost and latency percentiles.

A mixed-integer linear program (MILP) that jointly chooses GPU composition, per-replica deployment (DP/TP/PP), and fractional workload assignment under budget and real-time availability constraints.

Key Findings

Picking the right mix of GPU types improves cost-efficiency versus a homogeneous fleet.

Numbersup to 2.27× improvement in throughput-per-cost (benchmarking)

Practical UseRent a mix of data-center, workstation, and consumer GPUs and match each workload type to the GPU that fits its compute/memory profile to increase throughput per dollar.

Evidence RefObservation-1; Fig.3, Fig.11

Jointly tuning GPU composition, deployment config, and workload routing yields large gains.

Numberstoy example: composition +20% / config +14% / assignment +8%

Practical UseDon’t adjust only one lever. Co-optimize which GPUs you rent, how replicas are parallelized, and how requests are routed for best end-to-end speed.

Evidence Ref§4.2 simple example

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
End-to-end throughput improvement (best case)	up to 41% higher throughput vs homogeneous baselines	homogeneous GPU fleets	up to +41%	evaluated traces (Swiss AI Center, WildChat, Azure-Trace)	§5.2, Fig.5	Fig.5
Latency reduction (best case)	up to 54% lower tail latency	homogeneous GPU fleets	up to −54%	evaluated traces (see §5.2)	§5.2, Fig.6	Fig.6

What To Try In 7 Days

Profile your real request mix for input/output token lengths and classify into compute- vs memory-bound types.

Run one-time per-GPU profiling (prefill vs decode) to estimate per-config throughput as the paper does.

Simulate a small MILP or the paper's binary-search feasibility check over your budget to test a mixed-GPU plan before changing rentals.

Optimization Features

Infra Optimization

Budget-aware provisioningAvailability-aware renting

System Optimization

GPU UtilizationModel Routing

Inference Optimization

Efficient InferenceDistributed InferenceLatency Optimization

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

MILP search space grows combinatorially; worst-case solver time can be very large for many GPU/config combinations.

One-time profiling estimates have 4–7% error; misestimates can reduce optimality.

When Not To Use

You have an ample budget and can afford top-tier homogeneous GPUs — simple homogeneous deployment may suffice.

You cannot profile per-GPU/per-config latency (e.g., no access to representative GPUs or workloads).

Failure Modes

Profiling errors lead to systematically wrong config choices and degraded throughput.

Cloud GPU availability changes or rentals are preempted before replanning, causing allocation gaps.

Core Entities

Models

Llama3-8BLlama3-70B

Metrics

requests/secthroughput per $/hp90/p99/p100 latencytotal cost = latency × GPU price

Datasets

ShareGPTWildGPT / WildChatAzure-TraceSwiss AI Center traces (private)

Benchmarks

Throughput-per-cost (throughput / GPU price)Latency percentiles (P5–P100)Makespan on workload batches

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Picking the right mix of GPU types improves cost-efficiency versus a homogeneous fleet.

Jointly tuning GPU composition, deployment config, and workload routing yields large gains.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Multi-agent system + rubric RL that writes and optimizes full end-to-end CUDA programs

Key finding

Practical guide to cutting cloud and AI infra costs 28–90% using instance choices, quantization, and FinOps

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Block-wise Adam that lets you full-finetune 8B+ LLMs on a single 24GB GPU

Key finding