Overview
The system is implemented and evaluated on real clusters and traces; it shows large, reproducible speedups, but depends on available host storage/bandwidth and scheduler integration.
Citations6
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 3/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Cutting cold-start time from minutes to seconds reduces user-visible latency, lowers GPU idle costs, and increases successful request completion for large models, improving SLA and capacity efficiency.
Who Should Care
Summary TLDR
ServerlessLLM is a system that reduces serverless LLM startup latency by storing model checkpoints across a GPU server's DRAM/SSD tiers, using a loading-optimized checkpoint format and a pipelined loader, migrating live inferences by sending small token states (not large kv-caches), and picking servers with the lowest estimated startup time. In microbenchmarks it loads checkpoints 3.6–8.2× faster than PyTorch/Safetensors, speeds LoRA adapter loads 4.4×, and in cluster workloads it cuts end-to-end latency by 10–200× versus Ray Serve / KServe on evaluated traces and datasets.
Problem Statement
Serverless LLMs face very long and unpredictable cold-starts. Checkpoints are hundreds of GB, downloads can take tens of seconds (e.g., 130GB ≈ 26s at 5GB/s), and loading into GPU memory can take tens of seconds (e.g., OPT-30B 34s, LLaMA-2-70B 84s). These delays break interactive SLOs, raise GPU costs, and limit request throughput for serverless LLM services.
Main Contribution
A loading-optimized checkpoint format and a multi-stage, chunked loader that fully uses in-server DRAM/SSD/PCIe bandwidth to speed model loads
A token-only multi-round live migration for LLM inference that transfers small token state and recomputes large kv-caches at destination to avoid heavy network transfers
Key Findings
ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders
LoRA adapter loading drops from 370ms to 83.5ms
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Checkpoint cold-load speedup | 3.6–8.2× faster | PyTorch / Safetensors | — | OPT, LLaMA-2, Falcon checkpoints (FP16) on RAID0 NVMe | §7.2, Fig.6a | §7.2 |
| LoRA | 83.5 ms | Safetensors 370 ms | 4.4× faster | 1GB LoRA adapter for LLaMA-70B | §7.2 (LoRA experiment) | §7.2 |
What To Try In 7 Days
Measure current model cold-start time end-to-end (download + load) to get a baseline
Convert one popular model to a sequential chunked, loading-optimized checkpoint and test direct-I/O + pinned memory copy
Prototype token-only migration for long-running inferences to avoid transferring large kv-caches over the network
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Performance depends on available host DRAM/SSD and PCIe bandwidth; constrained clusters limit gains
Scheduler assumes accurate bandwidth/queue estimates; CUDA driver cleanup can cause occasional underestimates
When Not To Use
Small single-GPU deployments with no spare host memory or SSD bandwidth
Environments where moving tokens and recomputing kv-cache is more expensive than raw transfer (very high-bandwidth networks and very short prompts)
Failure Modes
Network or destination failure during migration causing resumed state loss or extra recomputation
Scheduler underestimation of cleanup/migration costs causing unexpected latency spikes

