Cut serverless LLM cold-starts 10–200× by local checkpoint formats, token-only live migration, and startup-aware scheduling

January 25, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

6

Authors

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai

Links

Abstract / PDF

Why It Matters For Business

Cutting cold-start time from minutes to seconds reduces user-visible latency, lowers GPU idle costs, and increases successful request completion for large models, improving SLA and capacity efficiency.

Summary TLDR

ServerlessLLM is a system that reduces serverless LLM startup latency by storing model checkpoints across a GPU server's DRAM/SSD tiers, using a loading-optimized checkpoint format and a pipelined loader, migrating live inferences by sending small token states (not large kv-caches), and picking servers with the lowest estimated startup time. In microbenchmarks it loads checkpoints 3.6–8.2× faster than PyTorch/Safetensors, speeds LoRA adapter loads 4.4×, and in cluster workloads it cuts end-to-end latency by 10–200× versus Ray Serve / KServe on evaluated traces and datasets.

Problem Statement

Serverless LLMs face very long and unpredictable cold-starts. Checkpoints are hundreds of GB, downloads can take tens of seconds (e.g., 130GB ≈ 26s at 5GB/s), and loading into GPU memory can take tens of seconds (e.g., OPT-30B 34s, LLaMA-2-70B 84s). These delays break interactive SLOs, raise GPU costs, and limit request throughput for serverless LLM services.

Main Contribution

A loading-optimized checkpoint format and a multi-stage, chunked loader that fully uses in-server DRAM/SSD/PCIe bandwidth to speed model loads

A token-only multi-round live migration for LLM inference that transfers small token state and recomputes large kv-caches at destination to avoid heavy network transfers

A startup-time-optimized scheduler that models queueing, per-tier bandwidth, and migration/resume cost to pick servers that minimize time-to-first-token

Key Findings

ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders

Numbers3.6–8.2× faster loading (OPT-2.7B to LLaMA-2-70B)

LoRA adapter loading drops from 370ms to 83.5ms

Numbers4.4× faster (1GB LoRA adapter for LLaMA-70B)

End-to-end serverless latency improves dramatically versus Ray Serve / KServe

Numbers10–200× lower latency across tested workloads

Concrete model start example: OPT-6.7B

Numbers0.8s (ServerlessLLM) vs 12.1s (Ray Serve) average start

Large-model example: OPT-30B startup reduced from minutes to seconds

Numbers7.5s (ServerlessLLM) vs 213s (Ray Serve)

Request completion within timeout improves for large models

Numbers89% requests succeed within 300s (ServerlessLLM) vs 26% (Ray Serve w/ Cache) for OPT-30B

Locality-aware scheduling plus token migration reduces P99 compared to preemption

NumbersUp to 1.95× lower P99 versus preemption-based scheduler in stressed runs

Results

Checkpoint cold-load speedup

Value3.6–8.2× faster

BaselinePyTorch / Safetensors

LoRA

Value83.5 ms

BaselineSafetensors 370 ms

OPT-6.7B average startup

Value0.8 s (ServerlessLLM)

Baseline12.1 s (Ray Serve)

OPT-30B average startup

Value7.5 s (ServerlessLLM)

Baseline213 s (Ray Serve)

End-to-end latency improvement vs Ray Serve family

Value10–200× lower latency

BaselineRay Serve / Ray Serve w/ Cache / KServe

Who Should Care

What To Try In 7 Days

Measure current model cold-start time end-to-end (download + load) to get a baseline

Convert one popular model to a sequential chunked, loading-optimized checkpoint and test direct-I/O + pinned memory copy

Prototype token-only migration for long-running inferences to avoid transferring large kv-caches over the network

Optimization Features

Token Efficiency

  • migrate tokens instead of kv-cache

Infra Optimization

  • use in-server DRAM/SSD RAID bandwidth
  • startup-time-aware scheduling

Model Optimization

  • loading-optimized checkpoint format
  • chunk-based partitioning per-GPU

System Optimization

  • direct I/O reads
  • pinned memory GPU DMA
  • multi-stage pipelined loader

Inference Optimization

  • token-only live migration
  • kv-cache recomputation at destination

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on available host DRAM/SSD and PCIe bandwidth; constrained clusters limit gains
  • Scheduler assumes accurate bandwidth/queue estimates; CUDA driver cleanup can cause occasional underestimates
  • Paper does not optimize global checkpoint placement across the cluster
  • Live migration relies on recomputation cost being cheaper than transfer; cost model may vary by model and hardware

When Not To Use

  • Small single-GPU deployments with no spare host memory or SSD bandwidth
  • Environments where moving tokens and recomputing kv-cache is more expensive than raw transfer (very high-bandwidth networks and very short prompts)
  • Systems that already keep all hot models permanently resident on GPUs (no cold-starts)

Failure Modes

  • Network or destination failure during migration causing resumed state loss or extra recomputation
  • Scheduler underestimation of cleanup/migration costs causing unexpected latency spikes
  • Exhaustion of server DRAM/SSD leading to remote downloads and long delays

Core Entities

Models

  • OPT
  • LLaMA-2
  • Falcon

Metrics

  • model startup latency
  • first-token latency
  • per-token latency
  • P95/P99 latency
  • bandwidth utilization

Datasets

  • GSM8K
  • ShareGPT
  • Azure Serverless Trace (workload trace)

Context Entities

Benchmarks

  • Azure Serverless Trace