Overview
System is implemented as vLLM patches and evaluated on real H200 hardware across three models and workloads; limitation is intra-node focus and no public code release.
Citations0
Evidence Strength0.80
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 7/7
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 80%
Novelty: 70%
Why It Matters For Business
Flying Serving reduces tail latency during bursts and keeps throughput near DP, letting operators serve mixed-priority and long-context traffic without costly restarts or wasted GPUs.
Who Should Care
Summary TLDR
Flying Serving is a vLLM-based runtime that switches between data-parallel (many independent replicas) and tensor-parallel (sharded operators) modes in milliseconds. It avoids copying weights, moving KV cache, or recreating communicators by (1) zero-copy logical weight views, (2) an adaptor that keeps a shared KV block pool while changing logical block sizing, and (3) an eagerly initialized communicator pool plus a scheduler with soft/hard preempt modes. On three models and bursty workloads it cuts tail latency up to 4.79×, keeps near-TP latency at low load, retains ≈95–96% of DP throughput, expands single-node context to 1.9M tokens, and performs live switch in ~15 ms.
Problem Statement
Serving LLMs must maximize throughput under strict latency and memory (context) constraints. Current stacks force a static choice between data parallelism (good for throughput) and tensor parallelism (good for per-request latency and context). Workloads are bursty, mixed-priority, and long-context, so a fixed parallelism often wastes capacity or breaks SLOs. The challenge: make DP↔TP switching cheap and safe without moving large tensors or rebuilding communication groups.
Main Contribution
Design and implement Flying Serving, a middleware on vLLM v1 that allows live DP↔TP switching without restarting workers.
Model Weights Manager that provides zero-copy logical TP shards on top of DP-loaded weights so no weight movement is needed at switch time.
Key Findings
Live DP↔TP switching reduces burst P90 TTFT up to 4.79× vs static TP on tested models.
At low load Flying Serving achieves near-TP latency with small overheads while keeping high throughput.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Burst P90 TTFT reduction vs static TP | up to 4.79× (Nemotron-8B) | static TP | 4.79× reduction | bursty synthetic trace (Section 6.1) | Section 6.2, Figure 8 | Fig.8 & text |
| Low-load average TTFT vs static TP | Llama-70B 223 ms (ours) vs 212 ms (TP) | static TP | ≈5.19% overhead | low-load windows | Section 6.2 'Light loads (flat periods)' | text |
What To Try In 7 Days
Run vLLM v1 with the Flying Serving patches on a single multi-GPU node and replay a bursty trace to compare TTFT and throughput.
Pre-initialize communicators for contiguous GPU groups and measure live switch latency versus a cold restart.
Enable soft- and hard-preempt policies and compare priority request latency and background throughput.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Designed for intra-node multi-GPU setups; multi-node model-parallel models fall outside scope.
Requires contiguous GPU topology (NVLink) for communicator pooling; non-contiguous topologies are unsupported.
When Not To Use
For extremely large models requiring multi-node model parallelism across machines.
In clusters without NVLink or with arbitrary non-contiguous device layouts.
Failure Modes
Soft preempt can cause recomputation overhead if speculative progress must be reconciled at merge time.
If communicator pool assumptions (contiguous groups) are violated, switching may be impossible or slower.

