Overview
Well-validated on A100/V100 hardware and large simulations; focused on single-model, homogeneous-GPU clusters and assumes no request migration.
Citations1
Evidence Strength0.75
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/6
Reproducibility
Status: Partial assets available
Open source: No
At A Glance
Cost impact: 80%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Aladdin can cut GPU spending by tens of percent while keeping token-level SLOs, turning large inference clusters from a fixed-cost bottleneck into a more efficient, demand-driven service.
Who Should Care
Summary TLDR
Aladdin is a cluster-level scheduler that predicts minimal GPU needs, picks per-worker GPU configurations, and places incoming LLM requests into workers using a KV-cache-aware, best-fit bin-packing heuristic. It models prefill (prompt) and decode (token generation) latencies separately, handles output-length prediction errors with a rebalancing step, and supports both continuous batching (vLLM-style) and split-phase setups. On their testbeds and simulations, Aladdin lowers GPU requirements up to ~71% for the same SLOs and keeps scheduling latency low (~<50 ms at 25 rps).
Problem Statement
Current LLM serving methods tune single workers or use simple queueing rules. They miss cluster-level joint decisions: how many GPUs to run, how to pack GPUs into workers, and how to place requests so per-request SLOs (first token and average per-token time) hold without wasting GPUs or overflowing KV cache (the memory holding past tokens).
Main Contribution
A simple, empirically validated performance model that separates prefill and decode iteration latency and KV-cache growth.
Aladdin: an online scheduler that jointly predicts minimal GPUs, chooses per-worker GPU counts, and places requests via a KV-cache-aware best-fit bin-packing heuristic.
Key Findings
Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.
For split-phase decode instances, Aladdin reduces GPUs needed by up to 60% vs. JSQ and 49% vs. power-of-two.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GPU count required (default continuous batching) | reduced up to 71% | vLLM default (JSQ, 4-GPU worker) | up to -71% | ShareGPT workloads (simulated high-demand) | Section 6.4, Figure 11 | Fig.11 |
| GPU count required (split-phase decode) | reduced up to 60% | JSQ baseline | up to -60% | large-scale simulation | Section 6.4, Figure 12 | Fig.12 |
What To Try In 7 Days
Measure your current per-request input/output length distributions and ATGT/TTFT SLOs.
Plug a lightweight per-iteration latency and KV-cache model (prefill vs decode) for your model and hardware.
Run a best-fit, KV-cache-aware placement simulator on historic traces to estimate potential GPU savings.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Designed for single-model serving; multi-model cold starts are not handled.
Does not migrate requests once scheduled; live migration costs are unaddressed.
When Not To Use
When you must support many different models on the same cluster with frequent cold starts.
When real-time request migration between workers is required.
Failure Modes
Severe output-length prediction bias causing SLO violations and KV-cache overflow.
Underestimated inter-GPU communication overhead on non-tested network topologies.

