Overview
Method is practical: fast profiler, ILP solves in ~1s, and experiments show large cost wins, but it assumes fixed workload distributions and does not handle autoscaling or instance unavailability.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 65%
Novelty: 50%
Why It Matters For Business
Picking the right mix of GPU types can cut cloud GPU costs up to ~77% for conversational LLMs while keeping latency targets, lowering monthly infrastructure bills without modifying models or inference logic.
Who Should Care
Summary TLDR
Mélange profiles available GPU types over request sizes and SLOs, then formulates GPU selection as a cost-aware bin-packing integer linear program to find the lowest-cost mix of GPUs that meets a service's latency SLO. Evaluation on Llama2 models and four NVIDIA GPUs (L4, A10G, A100, H100) shows Mélange reduces deployment cost vs single-GPU strategies by up to 77% (short-context), 33% (long-context), and 51% (mixed), while meeting TPOT SLOs for ≥99.5% of requests. Mélange assumes a fixed workload distribution and requires a one-time offline profile per GPU.
Problem Statement
Deploying LLMs is expensive and picking the wrong GPU type wastes money. GPU cost efficiency changes with request size, request rate, and latency SLO. Teams need an automated, simple way to pick a cost-minimal mix of GPU types for a given workload and SLO.
Main Contribution
Analysis showing GPU tokens-per-dollar (T/$) depends strongly on request size, request rate, and latency SLO.
Mélange: a practical framework that profiles GPU performance per request-size bucket and solves a cost-aware bin-packing ILP to pick a minimal-cost heterogeneous GPU allocation.
Key Findings
GPU cost efficiency (tokens per dollar) varies with request size; no single GPU is best for all sizes.
Latency SLO shifts which GPU is most cost-efficient.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Cost reduction (short-context dataset) | 9–77% vs best single-GPU | single-GPU-type allocations | — | Chatbot Arena | Fig.11a,d; §6.2 | Fig.11 |
| Cost reduction (long-context dataset) | 2–33% vs best single-GPU | single-GPU-type allocations | — | PubMed | Fig.11b,e; §6.2 | Fig.11 |
What To Try In 7 Days
Profile your model on candidate GPU instance types across typical short/long requests (<1 hr per GPU as in paper).
Use Mélange's ILP idea (or PuLP) to compute a minimal-cost mix given your request-size histogram, rate, and TPOT SLO.
Deploy a small mixed-GPU prototype with a simple load balancer and measure TPOT; add modest overprovisioning (e.g., +10%) to absorb bursts.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Assumes a fixed workload distribution and steady request rate; not an autoscaler.
Does not handle GPU unavailability or spot/preemptible instances directly.
When Not To Use
Traffic is highly bursty and you cannot provision buffer capacity or autoscale externally.
You rely on preemptible/spot instances with frequent interruption and no fallback.
Failure Modes
Short bursts or back-to-back large requests temporarily overload capacity and cause SLO violations (observed source of violations).
Load balancer misestimates output length and routes requests to under-provisioned GPUs.

