Cut serverless LLM cold-starts 10–200× by local checkpoint formats, token-only live migration, and startup-aware scheduling

Overview

Decision SnapshotReady For Pilot

The system is implemented and evaluated on real clusters and traces; it shows large, reproducible speedups, but depends on available host storage/bandwidth and scheduler integration.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai

Links

Abstract / PDF / Code

Why It Matters For Business

Cutting cold-start time from minutes to seconds reduces user-visible latency, lowers GPU idle costs, and increases successful request completion for large models, improving SLA and capacity efficiency.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Founder

Summary TLDR

ServerlessLLM is a system that reduces serverless LLM startup latency by storing model checkpoints across a GPU server's DRAM/SSD tiers, using a loading-optimized checkpoint format and a pipelined loader, migrating live inferences by sending small token states (not large kv-caches), and picking servers with the lowest estimated startup time. In microbenchmarks it loads checkpoints 3.6–8.2× faster than PyTorch/Safetensors, speeds LoRA adapter loads 4.4×, and in cluster workloads it cuts end-to-end latency by 10–200× versus Ray Serve / KServe on evaluated traces and datasets.

Problem Statement

Serverless LLMs face very long and unpredictable cold-starts. Checkpoints are hundreds of GB, downloads can take tens of seconds (e.g., 130GB ≈ 26s at 5GB/s), and loading into GPU memory can take tens of seconds (e.g., OPT-30B 34s, LLaMA-2-70B 84s). These delays break interactive SLOs, raise GPU costs, and limit request throughput for serverless LLM services.

Main Contribution

A loading-optimized checkpoint format and a multi-stage, chunked loader that fully uses in-server DRAM/SSD/PCIe bandwidth to speed model loads

A token-only multi-round live migration for LLM inference that transfers small token state and recomputes large kv-caches at destination to avoid heavy network transfers

Key Findings

ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders

Numbers3.6–8.2× faster loading (OPT-2.7B to LLaMA-2-70B)

Practical UseReplace naive torch/safetensors loading with a chunked direct-I/O loader to cut model startup by multiple×

Evidence Ref§7.2, Fig.6a

LoRA adapter loading drops from 370ms to 83.5ms

Numbers4.4× faster (1GB LoRA adapter for LLaMA-70B)

Practical UseUsing the loader speeds even small adapter loads, enabling faster dynamic adapter swaps

Evidence Ref§7.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Checkpoint cold-load speedup	3.6–8.2× faster	PyTorch / Safetensors	—	OPT, LLaMA-2, Falcon checkpoints (FP16) on RAID0 NVMe	§7.2, Fig.6a	§7.2
LoRA	83.5 ms	Safetensors 370 ms	4.4× faster	1GB LoRA adapter for LLaMA-70B	§7.2 (LoRA experiment)	§7.2

What To Try In 7 Days

Measure current model cold-start time end-to-end (download + load) to get a baseline

Convert one popular model to a sequential chunked, loading-optimized checkpoint and test direct-I/O + pinned memory copy

Prototype token-only migration for long-running inferences to avoid transferring large kv-caches over the network

Optimization Features

Token Efficiency

migrate tokens instead of kv-cache

Infra Optimization

use in-server DRAM/SSD RAID bandwidthstartup-time-aware scheduling

Model Optimization

loading-optimized checkpoint formatchunk-based partitioning per-GPU

System Optimization

direct I/O readspinned memory GPU DMAmulti-stage pipelined loader

Inference Optimization

token-only live migrationkv-cache recomputation at destination

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ServerlessLLM/ServerlessLLM

Risks & Boundaries

Limitations

Performance depends on available host DRAM/SSD and PCIe bandwidth; constrained clusters limit gains

Scheduler assumes accurate bandwidth/queue estimates; CUDA driver cleanup can cause occasional underestimates

When Not To Use

Small single-GPU deployments with no spare host memory or SSD bandwidth

Environments where moving tokens and recomputing kv-cache is more expensive than raw transfer (very high-bandwidth networks and very short prompts)

Failure Modes

Network or destination failure during migration causing resumed state loss or extra recomputation

Scheduler underestimation of cleanup/migration costs causing unexpected latency spikes

Core Entities

Models

OPTLLaMA-2Falcon

Metrics

model startup latencyfirst-token latencyper-token latencyP95/P99 latencybandwidth utilization

Cut serverless LLM cold-starts 10–200× by local checkpoint formats, token-only live migration, and startup-aware scheduling

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders

LoRA adapter loading drops from 370ms to 83.5ms

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Benchmarks

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders

LoRA adapter loading drops from 370ms to 83.5ms

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding