Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

Overview

Decision SnapshotReady For Pilot

Well-validated on A100/V100 hardware and large simulations; focused on single-model, homogeneous-GPU clusters and assumes no request migration.

Citations1

Evidence Strength0.75

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/6

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu

Links

Abstract / PDF / Data

Why It Matters For Business

Aladdin can cut GPU spending by tens of percent while keeping token-level SLOs, turning large inference clusters from a fixed-cost bottleneck into a more efficient, demand-driven service.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

Aladdin is a cluster-level scheduler that predicts minimal GPU needs, picks per-worker GPU configurations, and places incoming LLM requests into workers using a KV-cache-aware, best-fit bin-packing heuristic. It models prefill (prompt) and decode (token generation) latencies separately, handles output-length prediction errors with a rebalancing step, and supports both continuous batching (vLLM-style) and split-phase setups. On their testbeds and simulations, Aladdin lowers GPU requirements up to ~71% for the same SLOs and keeps scheduling latency low (~<50 ms at 25 rps).

Problem Statement

Current LLM serving methods tune single workers or use simple queueing rules. They miss cluster-level joint decisions: how many GPUs to run, how to pack GPUs into workers, and how to place requests so per-request SLOs (first token and average per-token time) hold without wasting GPUs or overflowing KV cache (the memory holding past tokens).

Main Contribution

A simple, empirically validated performance model that separates prefill and decode iteration latency and KV-cache growth.

Aladdin: an online scheduler that jointly predicts minimal GPUs, chooses per-worker GPU counts, and places requests via a KV-cache-aware best-fit bin-packing heuristic.

Key Findings

Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.

Numbersup to 71% GPU reduction

Practical UseCluster operators can run the same model with far fewer GPUs by adding Aladdin as a scheduling layer; expect multi-10s percent cost savings on heavy workloads.

Evidence RefSection 6.4, Figure 11

For split-phase decode instances, Aladdin reduces GPUs needed by up to 60% vs. JSQ and 49% vs. power-of-two.

Numbersup to 60% reduction (decode phase)

Practical UseIf you already split prefill/decode, replace JSQ/power-of-two with Aladdin's placement to save many decode-GPUs.

Evidence RefSection 6.4, Figure 12

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPU count required (default continuous batching)	reduced up to 71%	vLLM default (JSQ, 4-GPU worker)	up to -71%	ShareGPT workloads (simulated high-demand)	Section 6.4, Figure 11	Fig.11
GPU count required (split-phase decode)	reduced up to 60%	JSQ baseline	up to -60%	large-scale simulation	Section 6.4, Figure 12	Fig.12

What To Try In 7 Days

Measure your current per-request input/output length distributions and ATGT/TTFT SLOs.

Plug a lightweight per-iteration latency and KV-cache model (prefill vs decode) for your model and hardware.

Run a best-fit, KV-cache-aware placement simulator on historic traces to estimate potential GPU savings.

Optimization Features

Token Efficiency

ATGT SLO-based limits

Infra Optimization

Predict minimal GPU count per arrival rateDistributed grouped schedulers for high arrival rates

System Optimization

Best-fit bin-packing request placementPer-worker tensor-parallel sizingRebalancing for prediction errors

Inference Optimization

Dynamic batching (continuous batching)KV-cache-aware packingSplit-phase decode placement

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusNo

LicenseUnknown

Data URLs

https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

Risks & Boundaries

Limitations

Designed for single-model serving; multi-model cold starts are not handled.

Does not migrate requests once scheduled; live migration costs are unaddressed.

When Not To Use

When you must support many different models on the same cluster with frequent cold starts.

When real-time request migration between workers is required.

Failure Modes

Severe output-length prediction bias causing SLO violations and KV-cache overflow.

Underestimated inter-GPU communication overhead on non-tested network topologies.

Core Entities

Models

Llama2-7bLlama2-13bLlama2-70b

Metrics

ATGT (average token generation time)TTFT (time to first token)TBT (time between tokens)SLO attainment rateGPU count required

Datasets

ShareGPT_Vicuna_unfiltered (ShareGPT)

Context Entities

Models

Llama2-7bLlama2-13bLlama2-70b

Metrics

ATGTTTFT

Datasets

ShareGPT_Vicuna_unfiltered (ShareGPT)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.

For split-phase decode instances, Aladdin reduces GPUs needed by up to 60% vs. JSQ and 49% vs. power-of-two.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Move rollout work to cheap distributed GPUs and trade small policy lag for big cost savings.

Key finding

LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

Key finding

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

Key finding

Serve thousands of LoRA adapters from one machine by paging adapters and batching LoRA compute

Key finding