Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
6
Why It Matters For Business
Cutting cold-start time from minutes to seconds reduces user-visible latency, lowers GPU idle costs, and increases successful request completion for large models, improving SLA and capacity efficiency.
Summary TLDR
ServerlessLLM is a system that reduces serverless LLM startup latency by storing model checkpoints across a GPU server's DRAM/SSD tiers, using a loading-optimized checkpoint format and a pipelined loader, migrating live inferences by sending small token states (not large kv-caches), and picking servers with the lowest estimated startup time. In microbenchmarks it loads checkpoints 3.6–8.2× faster than PyTorch/Safetensors, speeds LoRA adapter loads 4.4×, and in cluster workloads it cuts end-to-end latency by 10–200× versus Ray Serve / KServe on evaluated traces and datasets.
Problem Statement
Serverless LLMs face very long and unpredictable cold-starts. Checkpoints are hundreds of GB, downloads can take tens of seconds (e.g., 130GB ≈ 26s at 5GB/s), and loading into GPU memory can take tens of seconds (e.g., OPT-30B 34s, LLaMA-2-70B 84s). These delays break interactive SLOs, raise GPU costs, and limit request throughput for serverless LLM services.
Main Contribution
A loading-optimized checkpoint format and a multi-stage, chunked loader that fully uses in-server DRAM/SSD/PCIe bandwidth to speed model loads
A token-only multi-round live migration for LLM inference that transfers small token state and recomputes large kv-caches at destination to avoid heavy network transfers
A startup-time-optimized scheduler that models queueing, per-tier bandwidth, and migration/resume cost to pick servers that minimize time-to-first-token
Key Findings
ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders
LoRA adapter loading drops from 370ms to 83.5ms
End-to-end serverless latency improves dramatically versus Ray Serve / KServe
Concrete model start example: OPT-6.7B
Large-model example: OPT-30B startup reduced from minutes to seconds
Request completion within timeout improves for large models
Locality-aware scheduling plus token migration reduces P99 compared to preemption
Results
Checkpoint cold-load speedup
LoRA
OPT-6.7B average startup
OPT-30B average startup
End-to-end latency improvement vs Ray Serve family
Who Should Care
What To Try In 7 Days
Measure current model cold-start time end-to-end (download + load) to get a baseline
Convert one popular model to a sequential chunked, loading-optimized checkpoint and test direct-I/O + pinned memory copy
Prototype token-only migration for long-running inferences to avoid transferring large kv-caches over the network
Optimization Features
Token Efficiency
- migrate tokens instead of kv-cache
Infra Optimization
- use in-server DRAM/SSD RAID bandwidth
- startup-time-aware scheduling
Model Optimization
- loading-optimized checkpoint format
- chunk-based partitioning per-GPU
System Optimization
- direct I/O reads
- pinned memory GPU DMA
- multi-stage pipelined loader
Inference Optimization
- token-only live migration
- kv-cache recomputation at destination
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance depends on available host DRAM/SSD and PCIe bandwidth; constrained clusters limit gains
- Scheduler assumes accurate bandwidth/queue estimates; CUDA driver cleanup can cause occasional underestimates
- Paper does not optimize global checkpoint placement across the cluster
- Live migration relies on recomputation cost being cheaper than transfer; cost model may vary by model and hardware
When Not To Use
- Small single-GPU deployments with no spare host memory or SSD bandwidth
- Environments where moving tokens and recomputing kv-cache is more expensive than raw transfer (very high-bandwidth networks and very short prompts)
- Systems that already keep all hot models permanently resident on GPUs (no cold-starts)
Failure Modes
- Network or destination failure during migration causing resumed state loss or extra recomputation
- Scheduler underestimation of cleanup/migration costs causing unexpected latency spikes
- Exhaustion of server DRAM/SSD leading to remote downloads and long delays
Core Entities
Models
- OPT
- LLaMA-2
- Falcon
Metrics
- model startup latency
- first-token latency
- per-token latency
- P95/P99 latency
- bandwidth utilization
Datasets
- GSM8K
- ShareGPT
- Azure Serverless Trace (workload trace)
Context Entities
Benchmarks
- Azure Serverless Trace

