Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

May 11, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu

Links

Abstract / PDF

Why It Matters For Business

Aladdin can cut GPU spending by tens of percent while keeping token-level SLOs, turning large inference clusters from a fixed-cost bottleneck into a more efficient, demand-driven service.

Summary TLDR

Aladdin is a cluster-level scheduler that predicts minimal GPU needs, picks per-worker GPU configurations, and places incoming LLM requests into workers using a KV-cache-aware, best-fit bin-packing heuristic. It models prefill (prompt) and decode (token generation) latencies separately, handles output-length prediction errors with a rebalancing step, and supports both continuous batching (vLLM-style) and split-phase setups. On their testbeds and simulations, Aladdin lowers GPU requirements up to ~71% for the same SLOs and keeps scheduling latency low (~<50 ms at 25 rps).

Problem Statement

Current LLM serving methods tune single workers or use simple queueing rules. They miss cluster-level joint decisions: how many GPUs to run, how to pack GPUs into workers, and how to place requests so per-request SLOs (first token and average per-token time) hold without wasting GPUs or overflowing KV cache (the memory holding past tokens).

Main Contribution

A simple, empirically validated performance model that separates prefill and decode iteration latency and KV-cache growth.

Aladdin: an online scheduler that jointly predicts minimal GPUs, chooses per-worker GPU counts, and places requests via a KV-cache-aware best-fit bin-packing heuristic.

A rebalancing algorithm to mitigate output-length prediction errors and an evaluation showing large GPU savings on A100/V100 testbeds and large simulations.

Key Findings

Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.

Numbersup to 71% GPU reduction

For split-phase decode instances, Aladdin reduces GPUs needed by up to 60% vs. JSQ and 49% vs. power-of-two.

Numbersup to 60% reduction (decode phase)

Performance models predict prefill, decode, and KV usage with small error: prefill <4%, decode <5%, KV-cache <1%; overall max model error <10%.

Numbersprefill <4% | decode <5% | KV <1% | max <10%

Centralized best-fit scheduling adds modest latency: under 50 ms for ~25 requests/sec.

Numbers<50 ms at 25 rps

Results

GPU count required (default continuous batching)

Valuereduced up to 71%

BaselinevLLM default (JSQ, 4-GPU worker)

GPU count required (split-phase decode)

Valuereduced up to 60%

BaselineJSQ baseline

Prefill latency model error

Value<4% max prediction error

Baselinemeasured latency

Decode iteration latency model error

Value<5% max prediction error

Baselinemeasured latency

KV cache usage prediction error

Value<1% error

Baselinemeasured KV usage

Scheduler latency (centralized best-fit)

Value<50 ms at 25 req/s

Baselinescheduling real-time limit

Who Should Care

What To Try In 7 Days

Measure your current per-request input/output length distributions and ATGT/TTFT SLOs.

Plug a lightweight per-iteration latency and KV-cache model (prefill vs decode) for your model and hardware.

Run a best-fit, KV-cache-aware placement simulator on historic traces to estimate potential GPU savings.

Optimization Features

Token Efficiency

  • ATGT SLO-based limits

Infra Optimization

  • Predict minimal GPU count per arrival rate
  • Distributed grouped schedulers for high arrival rates

System Optimization

  • Best-fit bin-packing request placement
  • Per-worker tensor-parallel sizing
  • Rebalancing for prediction errors

Inference Optimization

  • Dynamic batching (continuous batching)
  • KV-cache-aware packing
  • Split-phase decode placement

Reproducibility

Data Available

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Designed for single-model serving; multi-model cold starts are not handled.
  • Does not migrate requests once scheduled; live migration costs are unaddressed.
  • Assumes homogeneous GPUs and ignores switching/cold-start costs.
  • Output-length predictor used is simple; better predictors could improve results.

When Not To Use

  • When you must support many different models on the same cluster with frequent cold starts.
  • When real-time request migration between workers is required.
  • On extremely low-volume workloads where per-request batching gives no benefit.

Failure Modes

  • Severe output-length prediction bias causing SLO violations and KV-cache overflow.
  • Underestimated inter-GPU communication overhead on non-tested network topologies.
  • Centralized scheduler becomes a bottleneck at very high arrival rates without switching to distributed grouping.

Core Entities

Models

  • Llama2-7b
  • Llama2-13b
  • Llama2-70b

Metrics

  • ATGT (average token generation time)
  • TTFT (time to first token)
  • TBT (time between tokens)
  • SLO attainment rate
  • GPU count required

Datasets

  • ShareGPT_Vicuna_unfiltered (ShareGPT)

Context Entities

Models

  • Llama2-7b
  • Llama2-13b
  • Llama2-70b

Metrics

  • ATGT
  • TTFT

Datasets

  • ShareGPT_Vicuna_unfiltered (ShareGPT)