Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
Aladdin can cut GPU spending by tens of percent while keeping token-level SLOs, turning large inference clusters from a fixed-cost bottleneck into a more efficient, demand-driven service.
Summary TLDR
Aladdin is a cluster-level scheduler that predicts minimal GPU needs, picks per-worker GPU configurations, and places incoming LLM requests into workers using a KV-cache-aware, best-fit bin-packing heuristic. It models prefill (prompt) and decode (token generation) latencies separately, handles output-length prediction errors with a rebalancing step, and supports both continuous batching (vLLM-style) and split-phase setups. On their testbeds and simulations, Aladdin lowers GPU requirements up to ~71% for the same SLOs and keeps scheduling latency low (~<50 ms at 25 rps).
Problem Statement
Current LLM serving methods tune single workers or use simple queueing rules. They miss cluster-level joint decisions: how many GPUs to run, how to pack GPUs into workers, and how to place requests so per-request SLOs (first token and average per-token time) hold without wasting GPUs or overflowing KV cache (the memory holding past tokens).
Main Contribution
A simple, empirically validated performance model that separates prefill and decode iteration latency and KV-cache growth.
Aladdin: an online scheduler that jointly predicts minimal GPUs, chooses per-worker GPU counts, and places requests via a KV-cache-aware best-fit bin-packing heuristic.
A rebalancing algorithm to mitigate output-length prediction errors and an evaluation showing large GPU savings on A100/V100 testbeds and large simulations.
Key Findings
Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.
For split-phase decode instances, Aladdin reduces GPUs needed by up to 60% vs. JSQ and 49% vs. power-of-two.
Performance models predict prefill, decode, and KV usage with small error: prefill <4%, decode <5%, KV-cache <1%; overall max model error <10%.
Centralized best-fit scheduling adds modest latency: under 50 ms for ~25 requests/sec.
Results
GPU count required (default continuous batching)
GPU count required (split-phase decode)
Prefill latency model error
Decode iteration latency model error
KV cache usage prediction error
Scheduler latency (centralized best-fit)
Who Should Care
What To Try In 7 Days
Measure your current per-request input/output length distributions and ATGT/TTFT SLOs.
Plug a lightweight per-iteration latency and KV-cache model (prefill vs decode) for your model and hardware.
Run a best-fit, KV-cache-aware placement simulator on historic traces to estimate potential GPU savings.
Optimization Features
Token Efficiency
- ATGT SLO-based limits
Infra Optimization
- Predict minimal GPU count per arrival rate
- Distributed grouped schedulers for high arrival rates
System Optimization
- Best-fit bin-packing request placement
- Per-worker tensor-parallel sizing
- Rebalancing for prediction errors
Inference Optimization
- Dynamic batching (continuous batching)
- KV-cache-aware packing
- Split-phase decode placement
Reproducibility
Data Available
Open Source Status
- no
Risks & Boundaries
Limitations
- Designed for single-model serving; multi-model cold starts are not handled.
- Does not migrate requests once scheduled; live migration costs are unaddressed.
- Assumes homogeneous GPUs and ignores switching/cold-start costs.
- Output-length predictor used is simple; better predictors could improve results.
When Not To Use
- When you must support many different models on the same cluster with frequent cold starts.
- When real-time request migration between workers is required.
- On extremely low-volume workloads where per-request batching gives no benefit.
Failure Modes
- Severe output-length prediction bias causing SLO violations and KV-cache overflow.
- Underestimated inter-GPU communication overhead on non-tested network topologies.
- Centralized scheduler becomes a bottleneck at very high arrival rates without switching to distributed grouping.
Core Entities
Models
- Llama2-7b
- Llama2-13b
- Llama2-70b
Metrics
- ATGT (average token generation time)
- TTFT (time to first token)
- TBT (time between tokens)
- SLO attainment rate
- GPU count required
Datasets
- ShareGPT_Vicuna_unfiltered (ShareGPT)
Context Entities
Models
- Llama2-7b
- Llama2-13b
- Llama2-70b
Metrics
- ATGT
- TTFT
Datasets
- ShareGPT_Vicuna_unfiltered (ShareGPT)

