Overview
Production Readiness
0.8
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Flying Serving reduces tail latency during bursts and keeps throughput near DP, letting operators serve mixed-priority and long-context traffic without costly restarts or wasted GPUs.
Summary TLDR
Flying Serving is a vLLM-based runtime that switches between data-parallel (many independent replicas) and tensor-parallel (sharded operators) modes in milliseconds. It avoids copying weights, moving KV cache, or recreating communicators by (1) zero-copy logical weight views, (2) an adaptor that keeps a shared KV block pool while changing logical block sizing, and (3) an eagerly initialized communicator pool plus a scheduler with soft/hard preempt modes. On three models and bursty workloads it cuts tail latency up to 4.79×, keeps near-TP latency at low load, retains ≈95–96% of DP throughput, expands single-node context to 1.9M tokens, and performs live switch in ~15 ms.
Problem Statement
Serving LLMs must maximize throughput under strict latency and memory (context) constraints. Current stacks force a static choice between data parallelism (good for throughput) and tensor parallelism (good for per-request latency and context). Workloads are bursty, mixed-priority, and long-context, so a fixed parallelism often wastes capacity or breaks SLOs. The challenge: make DP↔TP switching cheap and safe without moving large tensors or rebuilding communication groups.
Main Contribution
Design and implement Flying Serving, a middleware on vLLM v1 that allows live DP↔TP switching without restarting workers.
Model Weights Manager that provides zero-copy logical TP shards on top of DP-loaded weights so no weight movement is needed at switch time.
KV Cache Adaptor that keeps a shared physical KV block pool and changes logical block sizing to preserve KV state across modes.
Communicator Pool that pre-initializes topology-aware NCCL/Gloo groups to avoid runtime communicator creation.
Workload-aware scheduler with soft preempt (speculative progress) and hard preempt (latency-first interruption) to coordinate safe transitions.
Key Findings
Live DP↔TP switching reduces burst P90 TTFT up to 4.79× vs static TP on tested models.
At low load Flying Serving achieves near-TP latency with small overheads while keeping high throughput.
Peak throughput is preserved close to DP while improving per-token latency over DP.
Live switching and context scaling are orders of magnitude faster than cold restarts.
Single-node max context increases substantially by dynamic merging.
Results
Burst P90 TTFT reduction vs static TP
Low-load average TTFT vs static TP
Median TPOT improvement vs static DP
Peak throughput retained vs DP
Live switching latency
Max single-node context support
Mixed-priority TTFT (all requests)
Who Should Care
What To Try In 7 Days
Run vLLM v1 with the Flying Serving patches on a single multi-GPU node and replay a bursty trace to compare TTFT and throughput.
Pre-initialize communicators for contiguous GPU groups and measure live switch latency versus a cold restart.
Enable soft- and hard-preempt policies and compare priority request latency and background throughput.
Optimization Features
Infra Optimization
- topology-aware communicator initialization (contiguous NVLink groups)
Model Optimization
- zero-copy logical weight views (no weight movement)
System Optimization
- KV Cache Adaptor with adaptive block sizing
- shared physical KV pool with logical remapping
Inference Optimization
- dynamic DP↔TP switching
- pre-initialized communicator pool
- soft preempt (speculative progress)
- hard preempt (latency-first interruption)
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Designed for intra-node multi-GPU setups; multi-node model-parallel models fall outside scope.
- Requires contiguous GPU topology (NVLink) for communicator pooling; non-contiguous topologies are unsupported.
- Paper presents vLLM patches but does not provide public release or deployment recipes.
When Not To Use
- For extremely large models requiring multi-node model parallelism across machines.
- In clusters without NVLink or with arbitrary non-contiguous device layouts.
- When regulatory or operational constraints forbid runtime scheduler-driven preemption of requests.
Failure Modes
- Soft preempt can cause recomputation overhead if speculative progress must be reconciled at merge time.
- If communicator pool assumptions (contiguous groups) are violated, switching may be impossible or slower.
- Scheduler bugs or mismatched global ordering could cause stalls; careful testing needed in production orchestration.
Core Entities
Models
- Llama-3-70B
- GPT-OSS-120B
- Nemotron-8B
Metrics
- Time To First Token (TTFT)
- Time Per Output Token (TPOT)
- Peak generation throughput (tokens/s)
- Queue time
- Inter-Token Latency (ILT)
Datasets
- ShareGPT
- CodeActInstruct
- HumanEval
Benchmarks
- Shift-Parallelism (baseline)
- Static DP
- Static TP

