Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

May 11, 20247 min

Overview

Decision SnapshotReady For Pilot

Well-validated on A100/V100 hardware and large simulations; focused on single-model, homogeneous-GPU clusters and assumes no request migration.

Citations1

Evidence Strength0.75

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/6

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu

Links

Abstract / PDF / Data

Why It Matters For Business

Aladdin can cut GPU spending by tens of percent while keeping token-level SLOs, turning large inference clusters from a fixed-cost bottleneck into a more efficient, demand-driven service.

Who Should Care

Summary TLDR

Aladdin is a cluster-level scheduler that predicts minimal GPU needs, picks per-worker GPU configurations, and places incoming LLM requests into workers using a KV-cache-aware, best-fit bin-packing heuristic. It models prefill (prompt) and decode (token generation) latencies separately, handles output-length prediction errors with a rebalancing step, and supports both continuous batching (vLLM-style) and split-phase setups. On their testbeds and simulations, Aladdin lowers GPU requirements up to ~71% for the same SLOs and keeps scheduling latency low (~<50 ms at 25 rps).

Problem Statement

Current LLM serving methods tune single workers or use simple queueing rules. They miss cluster-level joint decisions: how many GPUs to run, how to pack GPUs into workers, and how to place requests so per-request SLOs (first token and average per-token time) hold without wasting GPUs or overflowing KV cache (the memory holding past tokens).

Main Contribution

A simple, empirically validated performance model that separates prefill and decode iteration latency and KV-cache growth.

Aladdin: an online scheduler that jointly predicts minimal GPUs, chooses per-worker GPU counts, and places requests via a KV-cache-aware best-fit bin-packing heuristic.

Key Findings

Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.

Numbersup to 71% GPU reduction

Practical UseCluster operators can run the same model with far fewer GPUs by adding Aladdin as a scheduling layer; expect multi-10s percent cost savings on heavy workloads.

Evidence RefSection 6.4, Figure 11

For split-phase decode instances, Aladdin reduces GPUs needed by up to 60% vs. JSQ and 49% vs. power-of-two.

Numbersup to 60% reduction (decode phase)

Practical UseIf you already split prefill/decode, replace JSQ/power-of-two with Aladdin's placement to save many decode-GPUs.

Evidence RefSection 6.4, Figure 12

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPU count required (default continuous batching)reduced up to 71%vLLM default (JSQ, 4-GPU worker)up to -71%ShareGPT workloads (simulated high-demand)Section 6.4, Figure 11Fig.11
GPU count required (split-phase decode)reduced up to 60%JSQ baselineup to -60%large-scale simulationSection 6.4, Figure 12Fig.12

What To Try In 7 Days

Measure your current per-request input/output length distributions and ATGT/TTFT SLOs.

Plug a lightweight per-iteration latency and KV-cache model (prefill vs decode) for your model and hardware.

Run a best-fit, KV-cache-aware placement simulator on historic traces to estimate potential GPU savings.

Optimization Features

Token Efficiency
ATGT SLO-based limits
Infra Optimization
Predict minimal GPU count per arrival rateDistributed grouped schedulers for high arrival rates
System Optimization
Best-fit bin-packing request placementPer-worker tensor-parallel sizingRebalancing for prediction errors
Inference Optimization
Dynamic batching (continuous batching)KV-cache-aware packingSplit-phase decode placement

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Designed for single-model serving; multi-model cold starts are not handled.

Does not migrate requests once scheduled; live migration costs are unaddressed.

When Not To Use

When you must support many different models on the same cluster with frequent cold starts.

When real-time request migration between workers is required.

Failure Modes

Severe output-length prediction bias causing SLO violations and KV-cache overflow.

Underestimated inter-GPU communication overhead on non-tested network topologies.

Core Entities

Models

Llama2-7bLlama2-13bLlama2-70b

Metrics

ATGT (average token generation time)TTFT (time to first token)TBT (time between tokens)SLO attainment rateGPU count required

Datasets

ShareGPT_Vicuna_unfiltered (ShareGPT)

Context Entities

Models

Llama2-7bLlama2-13bLlama2-70b

Metrics

ATGTTTFT

Datasets

ShareGPT_Vicuna_unfiltered (ShareGPT)