Automatically pick a cheapest mix of GPU types for an LLM service using profiling + an ILP bin-packing solver

April 22, 20247 min

Overview

Production Readiness

0.65

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

1

Authors

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica

Links

Abstract / PDF

Why It Matters For Business

Picking the right mix of GPU types can cut cloud GPU costs up to ~77% for conversational LLMs while keeping latency targets, lowering monthly infrastructure bills without modifying models or inference logic.

Summary TLDR

Mélange profiles available GPU types over request sizes and SLOs, then formulates GPU selection as a cost-aware bin-packing integer linear program to find the lowest-cost mix of GPUs that meets a service's latency SLO. Evaluation on Llama2 models and four NVIDIA GPUs (L4, A10G, A100, H100) shows Mélange reduces deployment cost vs single-GPU strategies by up to 77% (short-context), 33% (long-context), and 51% (mixed), while meeting TPOT SLOs for ≥99.5% of requests. Mélange assumes a fixed workload distribution and requires a one-time offline profile per GPU.

Problem Statement

Deploying LLMs is expensive and picking the wrong GPU type wastes money. GPU cost efficiency changes with request size, request rate, and latency SLO. Teams need an automated, simple way to pick a cost-minimal mix of GPU types for a given workload and SLO.

Main Contribution

Analysis showing GPU tokens-per-dollar (T/$) depends strongly on request size, request rate, and latency SLO.

Mélange: a practical framework that profiles GPU performance per request-size bucket and solves a cost-aware bin-packing ILP to pick a minimal-cost heterogeneous GPU allocation.

Empirical evaluation across Llama2 models and four NVIDIA GPUs showing large cost reductions (up to 77%) and high SLO adherence (>99.5%).

Key Findings

GPU cost efficiency (tokens per dollar) varies with request size; no single GPU is best for all sizes.

NumbersA10G up to 2.6× T/$ over A100 for small requests; A100 up to 1.5× for large requests

Latency SLO shifts which GPU is most cost-efficient.

NumbersUnder tight TPOT (<60ms), A100 ≈ 2× T/$ vs A10G; loosening SLO to 80–160ms lets A10G exceed A100 by >40%

Mixing GPU types reduces cost by enabling finer-grained scaling at low request rates.

NumbersExample: 2 A100 + 1 A10G gave 24% lower cost than A100-only in §4.4

Mélange produces large end-to-end cost savings while meeting SLOs in experiments.

NumbersUp to 77% cost reduction (short-context); SLO adherence ≥99.5% (40ms) and ≥99.95% (120ms)

Results

Cost reduction (short-context dataset)

Value9–77% vs best single-GPU

Baselinesingle-GPU-type allocations

Cost reduction (long-context dataset)

Value2–33% vs best single-GPU

Baselinesingle-GPU-type allocations

Cost reduction (mixed dataset)

Value4–51% vs best single-GPU

Baselinesingle-GPU-type allocations

SLO adherence (TPOT)

Value≥99.95% at 120ms; 99.5% at 40ms

Solver runtime

Value<=1.2 s across experiments

Who Should Care

What To Try In 7 Days

Profile your model on candidate GPU instance types across typical short/long requests (<1 hr per GPU as in paper).

Use Mélange's ILP idea (or PuLP) to compute a minimal-cost mix given your request-size histogram, rate, and TPOT SLO.

Deploy a small mixed-GPU prototype with a simple load balancer and measure TPOT; add modest overprovisioning (e.g., +10%) to absorb bursts.

Optimization Features

Token Efficiency

  • Tokens per dollar (T/$)

Infra Optimization

  • Instance mix and right-sizing

System Optimization

  • Heterogeneous GPU allocation
  • Cost-aware resource packing

Inference Optimization

  • GPU Utilization
  • Efficient Inference
  • Latency Optimization

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Assumes a fixed workload distribution and steady request rate; not an autoscaler.
  • Does not handle GPU unavailability or spot/preemptible instances directly.
  • Evaluation limited to four NVIDIA GPU types and vLLM; results may differ with other engines or compression techniques.
  • One-time profiling must be recomputed if model, container, or cloud pricing changes.

When Not To Use

  • Traffic is highly bursty and you cannot provision buffer capacity or autoscale externally.
  • You rely on preemptible/spot instances with frequent interruption and no fallback.
  • You need per-request real-time decisioning that cannot tolerate ILP re-run latency (though solver is fast).

Failure Modes

  • Short bursts or back-to-back large requests temporarily overload capacity and cause SLO violations (observed source of violations).
  • Load balancer misestimates output length and routes requests to under-provisioned GPUs.
  • Cloud price changes or new GPU types invalidate offline profiling and allocation choices.

Core Entities

Models

  • Llama2-7b
  • Llama2-70b

Metrics

  • T/$ (tokens per dollar)
  • TPOT (time per output token)
  • TTFT (time to first token)
  • Solver time (s)

Datasets

  • Chatbot Arena
  • PubMed
  • Synthetic (80% Arena + 20% PubMed)