Overview
Production Readiness
0.65
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
Picking the right mix of GPU types can cut cloud GPU costs up to ~77% for conversational LLMs while keeping latency targets, lowering monthly infrastructure bills without modifying models or inference logic.
Summary TLDR
Mélange profiles available GPU types over request sizes and SLOs, then formulates GPU selection as a cost-aware bin-packing integer linear program to find the lowest-cost mix of GPUs that meets a service's latency SLO. Evaluation on Llama2 models and four NVIDIA GPUs (L4, A10G, A100, H100) shows Mélange reduces deployment cost vs single-GPU strategies by up to 77% (short-context), 33% (long-context), and 51% (mixed), while meeting TPOT SLOs for ≥99.5% of requests. Mélange assumes a fixed workload distribution and requires a one-time offline profile per GPU.
Problem Statement
Deploying LLMs is expensive and picking the wrong GPU type wastes money. GPU cost efficiency changes with request size, request rate, and latency SLO. Teams need an automated, simple way to pick a cost-minimal mix of GPU types for a given workload and SLO.
Main Contribution
Analysis showing GPU tokens-per-dollar (T/$) depends strongly on request size, request rate, and latency SLO.
Mélange: a practical framework that profiles GPU performance per request-size bucket and solves a cost-aware bin-packing ILP to pick a minimal-cost heterogeneous GPU allocation.
Empirical evaluation across Llama2 models and four NVIDIA GPUs showing large cost reductions (up to 77%) and high SLO adherence (>99.5%).
Key Findings
GPU cost efficiency (tokens per dollar) varies with request size; no single GPU is best for all sizes.
Latency SLO shifts which GPU is most cost-efficient.
Mixing GPU types reduces cost by enabling finer-grained scaling at low request rates.
Mélange produces large end-to-end cost savings while meeting SLOs in experiments.
Results
Cost reduction (short-context dataset)
Cost reduction (long-context dataset)
Cost reduction (mixed dataset)
SLO adherence (TPOT)
Solver runtime
Who Should Care
What To Try In 7 Days
Profile your model on candidate GPU instance types across typical short/long requests (<1 hr per GPU as in paper).
Use Mélange's ILP idea (or PuLP) to compute a minimal-cost mix given your request-size histogram, rate, and TPOT SLO.
Deploy a small mixed-GPU prototype with a simple load balancer and measure TPOT; add modest overprovisioning (e.g., +10%) to absorb bursts.
Optimization Features
Token Efficiency
- Tokens per dollar (T/$)
Infra Optimization
- Instance mix and right-sizing
System Optimization
- Heterogeneous GPU allocation
- Cost-aware resource packing
Inference Optimization
- GPU Utilization
- Efficient Inference
- Latency Optimization
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Assumes a fixed workload distribution and steady request rate; not an autoscaler.
- Does not handle GPU unavailability or spot/preemptible instances directly.
- Evaluation limited to four NVIDIA GPU types and vLLM; results may differ with other engines or compression techniques.
- One-time profiling must be recomputed if model, container, or cloud pricing changes.
When Not To Use
- Traffic is highly bursty and you cannot provision buffer capacity or autoscale externally.
- You rely on preemptible/spot instances with frequent interruption and no fallback.
- You need per-request real-time decisioning that cannot tolerate ILP re-run latency (though solver is fast).
Failure Modes
- Short bursts or back-to-back large requests temporarily overload capacity and cause SLO violations (observed source of violations).
- Load balancer misestimates output length and routes requests to under-provisioned GPUs.
- Cloud price changes or new GPU types invalidate offline profiling and allocation choices.
Core Entities
Models
- Llama2-7b
- Llama2-70b
Metrics
- T/$ (tokens per dollar)
- TPOT (time per output token)
- TTFT (time to first token)
- Solver time (s)
Datasets
- Chatbot Arena
- PubMed
- Synthetic (80% Arena + 20% PubMed)

