Automatically pick a cheapest mix of GPU types for an LLM service using profiling + an ILP bin-packing solver

April 22, 20247 min

Overview

Decision SnapshotReady For Pilot

Method is practical: fast profiler, ILP solves in ~1s, and experiments show large cost wins, but it assumes fixed workload distributions and does not handle autoscaling or instance unavailability.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 65%

Novelty: 50%

Authors

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica

Links

Abstract / PDF

Why It Matters For Business

Picking the right mix of GPU types can cut cloud GPU costs up to ~77% for conversational LLMs while keeping latency targets, lowering monthly infrastructure bills without modifying models or inference logic.

Who Should Care

Summary TLDR

Mélange profiles available GPU types over request sizes and SLOs, then formulates GPU selection as a cost-aware bin-packing integer linear program to find the lowest-cost mix of GPUs that meets a service's latency SLO. Evaluation on Llama2 models and four NVIDIA GPUs (L4, A10G, A100, H100) shows Mélange reduces deployment cost vs single-GPU strategies by up to 77% (short-context), 33% (long-context), and 51% (mixed), while meeting TPOT SLOs for ≥99.5% of requests. Mélange assumes a fixed workload distribution and requires a one-time offline profile per GPU.

Problem Statement

Deploying LLMs is expensive and picking the wrong GPU type wastes money. GPU cost efficiency changes with request size, request rate, and latency SLO. Teams need an automated, simple way to pick a cost-minimal mix of GPU types for a given workload and SLO.

Main Contribution

Analysis showing GPU tokens-per-dollar (T/$) depends strongly on request size, request rate, and latency SLO.

Mélange: a practical framework that profiles GPU performance per request-size bucket and solves a cost-aware bin-packing ILP to pick a minimal-cost heterogeneous GPU allocation.

Key Findings

GPU cost efficiency (tokens per dollar) varies with request size; no single GPU is best for all sizes.

NumbersA10G up to 2.6× T/$ over A100 for small requests; A100 up to 1.5× for large requests

Practical UseMix cheap and expensive GPUs: use lower-end GPUs for many short requests and higher-end GPUs for large requests to lower cost.

Evidence Ref§4.2, Fig.3

Latency SLO shifts which GPU is most cost-efficient.

NumbersUnder tight TPOT (<60ms), A100 ≈ 2× T/$ vs A10G; loosening SLO to 80–160ms lets A10G exceed A100 by >40%

Practical UseChoose faster GPUs when you need tight latency; if you can loosen SLOs, prefer cheaper GPUs.

Evidence Ref§4.3, Fig.6-7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Cost reduction (short-context dataset)977% vs best single-GPUsingle-GPU-type allocationsChatbot ArenaFig.11a,d; §6.2Fig.11
Cost reduction (long-context dataset)233% vs best single-GPUsingle-GPU-type allocationsPubMedFig.11b,e; §6.2Fig.11

What To Try In 7 Days

Profile your model on candidate GPU instance types across typical short/long requests (<1 hr per GPU as in paper).

Use Mélange's ILP idea (or PuLP) to compute a minimal-cost mix given your request-size histogram, rate, and TPOT SLO.

Deploy a small mixed-GPU prototype with a simple load balancer and measure TPOT; add modest overprovisioning (e.g., +10%) to absorb bursts.

Optimization Features

Token Efficiency
Tokens per dollar (T/$)
Infra Optimization
Instance mix and right-sizing
System Optimization
Heterogeneous GPU allocationCost-aware resource packing
Inference Optimization
GPU UtilizationEfficient InferenceLatency Optimization

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Assumes a fixed workload distribution and steady request rate; not an autoscaler.

Does not handle GPU unavailability or spot/preemptible instances directly.

When Not To Use

Traffic is highly bursty and you cannot provision buffer capacity or autoscale externally.

You rely on preemptible/spot instances with frequent interruption and no fallback.

Failure Modes

Short bursts or back-to-back large requests temporarily overload capacity and cause SLO violations (observed source of violations).

Load balancer misestimates output length and routes requests to under-provisioned GPUs.

Core Entities

Models

Llama2-7bLlama2-70b

Metrics

T/$ (tokens per dollar)TPOT (time per output token)TTFT (time to first token)Solver time (s)

Datasets

Chatbot ArenaPubMedSynthetic (80% Arena + 20% PubMed)