Cut serverless LLM cold-starts 10–200× by local checkpoint formats, token-only live migration, and startup-aware scheduling

January 25, 20247 min

Overview

Decision SnapshotReady For Pilot

The system is implemented and evaluated on real clusters and traces; it shows large, reproducible speedups, but depends on available host storage/bandwidth and scheduler integration.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai

Links

Abstract / PDF / Code

Why It Matters For Business

Cutting cold-start time from minutes to seconds reduces user-visible latency, lowers GPU idle costs, and increases successful request completion for large models, improving SLA and capacity efficiency.

Who Should Care

Summary TLDR

ServerlessLLM is a system that reduces serverless LLM startup latency by storing model checkpoints across a GPU server's DRAM/SSD tiers, using a loading-optimized checkpoint format and a pipelined loader, migrating live inferences by sending small token states (not large kv-caches), and picking servers with the lowest estimated startup time. In microbenchmarks it loads checkpoints 3.6–8.2× faster than PyTorch/Safetensors, speeds LoRA adapter loads 4.4×, and in cluster workloads it cuts end-to-end latency by 10–200× versus Ray Serve / KServe on evaluated traces and datasets.

Problem Statement

Serverless LLMs face very long and unpredictable cold-starts. Checkpoints are hundreds of GB, downloads can take tens of seconds (e.g., 130GB ≈ 26s at 5GB/s), and loading into GPU memory can take tens of seconds (e.g., OPT-30B 34s, LLaMA-2-70B 84s). These delays break interactive SLOs, raise GPU costs, and limit request throughput for serverless LLM services.

Main Contribution

A loading-optimized checkpoint format and a multi-stage, chunked loader that fully uses in-server DRAM/SSD/PCIe bandwidth to speed model loads

A token-only multi-round live migration for LLM inference that transfers small token state and recomputes large kv-caches at destination to avoid heavy network transfers

Key Findings

ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders

Numbers3.68.2× faster loading (OPT-2.7B to LLaMA-2-70B)

Practical UseReplace naive torch/safetensors loading with a chunked direct-I/O loader to cut model startup by multiple×

Evidence Ref§7.2, Fig.6a

LoRA adapter loading drops from 370ms to 83.5ms

Numbers4.4× faster (1GB LoRA adapter for LLaMA-70B)

Practical UseUsing the loader speeds even small adapter loads, enabling faster dynamic adapter swaps

Evidence Ref§7.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Checkpoint cold-load speedup3.68.2× fasterPyTorch / SafetensorsOPT, LLaMA-2, Falcon checkpoints (FP16) on RAID0 NVMe§7.2, Fig.6a§7.2
LoRA83.5 msSafetensors 370 ms4.4× faster1GB LoRA adapter for LLaMA-70B§7.2 (LoRA experiment)§7.2

What To Try In 7 Days

Measure current model cold-start time end-to-end (download + load) to get a baseline

Convert one popular model to a sequential chunked, loading-optimized checkpoint and test direct-I/O + pinned memory copy

Prototype token-only migration for long-running inferences to avoid transferring large kv-caches over the network

Optimization Features

Token Efficiency
migrate tokens instead of kv-cache
Infra Optimization
use in-server DRAM/SSD RAID bandwidthstartup-time-aware scheduling
Model Optimization
loading-optimized checkpoint formatchunk-based partitioning per-GPU
System Optimization
direct I/O readspinned memory GPU DMAmulti-stage pipelined loader
Inference Optimization
token-only live migrationkv-cache recomputation at destination

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Performance depends on available host DRAM/SSD and PCIe bandwidth; constrained clusters limit gains

Scheduler assumes accurate bandwidth/queue estimates; CUDA driver cleanup can cause occasional underestimates

When Not To Use

Small single-GPU deployments with no spare host memory or SSD bandwidth

Environments where moving tokens and recomputing kv-cache is more expensive than raw transfer (very high-bandwidth networks and very short prompts)

Failure Modes

Network or destination failure during migration causing resumed state loss or extra recomputation

Scheduler underestimation of cleanup/migration costs causing unexpected latency spikes

Core Entities

Models

OPTLLaMA-2Falcon

Metrics

model startup latencyfirst-token latencyper-token latencyP95/P99 latencybandwidth utilization

Datasets

GSM8KShareGPTAzure Serverless Trace (workload trace)

Context Entities

Benchmarks

Azure Serverless Trace