Automatically pick a cheapest mix of GPU types for an LLM service using profiling + an ILP bin-packing solver

Overview

Decision SnapshotReady For Pilot

Method is practical: fast profiler, ILP solves in ~1s, and experiments show large cost wins, but it assumes fixed workload distributions and does not handle autoscaling or instance unavailability.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 65%

Novelty: 50%

Authors

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica

Links

Abstract / PDF

Why It Matters For Business

Picking the right mix of GPU types can cut cloud GPU costs up to ~77% for conversational LLMs while keeping latency targets, lowering monthly infrastructure bills without modifying models or inference logic.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager

Summary TLDR

Mélange profiles available GPU types over request sizes and SLOs, then formulates GPU selection as a cost-aware bin-packing integer linear program to find the lowest-cost mix of GPUs that meets a service's latency SLO. Evaluation on Llama2 models and four NVIDIA GPUs (L4, A10G, A100, H100) shows Mélange reduces deployment cost vs single-GPU strategies by up to 77% (short-context), 33% (long-context), and 51% (mixed), while meeting TPOT SLOs for ≥99.5% of requests. Mélange assumes a fixed workload distribution and requires a one-time offline profile per GPU.

Problem Statement

Deploying LLMs is expensive and picking the wrong GPU type wastes money. GPU cost efficiency changes with request size, request rate, and latency SLO. Teams need an automated, simple way to pick a cost-minimal mix of GPU types for a given workload and SLO.

Main Contribution

Analysis showing GPU tokens-per-dollar (T/$) depends strongly on request size, request rate, and latency SLO.

Mélange: a practical framework that profiles GPU performance per request-size bucket and solves a cost-aware bin-packing ILP to pick a minimal-cost heterogeneous GPU allocation.

Key Findings

GPU cost efficiency (tokens per dollar) varies with request size; no single GPU is best for all sizes.

NumbersA10G up to 2.6× T/$ over A100 for small requests; A100 up to 1.5× for large requests

Practical UseMix cheap and expensive GPUs: use lower-end GPUs for many short requests and higher-end GPUs for large requests to lower cost.

Evidence Ref§4.2, Fig.3

Latency SLO shifts which GPU is most cost-efficient.

NumbersUnder tight TPOT (<60ms), A100 ≈ 2× T/$ vs A10G; loosening SLO to 80–160ms lets A10G exceed A100 by >40%

Practical UseChoose faster GPUs when you need tight latency; if you can loosen SLOs, prefer cheaper GPUs.

Evidence Ref§4.3, Fig.6-7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Cost reduction (short-context dataset)	9–77% vs best single-GPU	single-GPU-type allocations	—	Chatbot Arena	Fig.11a,d; §6.2	Fig.11
Cost reduction (long-context dataset)	2–33% vs best single-GPU	single-GPU-type allocations	—	PubMed	Fig.11b,e; §6.2	Fig.11

What To Try In 7 Days

Profile your model on candidate GPU instance types across typical short/long requests (<1 hr per GPU as in paper).

Use Mélange's ILP idea (or PuLP) to compute a minimal-cost mix given your request-size histogram, rate, and TPOT SLO.

Deploy a small mixed-GPU prototype with a simple load balancer and measure TPOT; add modest overprovisioning (e.g., +10%) to absorb bursts.

Optimization Features

Token Efficiency

Tokens per dollar (T/$)

Infra Optimization

Instance mix and right-sizing

System Optimization

Heterogeneous GPU allocationCost-aware resource packing

Inference Optimization

GPU UtilizationEfficient InferenceLatency Optimization

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Assumes a fixed workload distribution and steady request rate; not an autoscaler.

Does not handle GPU unavailability or spot/preemptible instances directly.

When Not To Use

Traffic is highly bursty and you cannot provision buffer capacity or autoscale externally.

You rely on preemptible/spot instances with frequent interruption and no fallback.

Failure Modes

Short bursts or back-to-back large requests temporarily overload capacity and cause SLO violations (observed source of violations).

Load balancer misestimates output length and routes requests to under-provisioned GPUs.

Core Entities

Models

Llama2-7bLlama2-70b

Metrics

T/$ (tokens per dollar)TPOT (time per output token)TTFT (time to first token)Solver time (s)

Datasets

Chatbot ArenaPubMedSynthetic (80% Arena + 20% PubMed)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPU cost efficiency (tokens per dollar) varies with request size; no single GPU is best for all sizes.

Latency SLO shifts which GPU is most cost-efficient.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Multi-agent system + rubric RL that writes and optimizes full end-to-end CUDA programs

Key finding

Practical guide to cutting cloud and AI infra costs 28–90% using instance choices, quantization, and FinOps

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Block-wise Adam that lets you full-finetune 8B+ LLMs on a single 24GB GPU

Key finding