Flying Serving: instant DP↔TP switching to cut tail latency, keep throughput, and scale context

February 26, 20268 min

Overview

Decision SnapshotReady For Pilot

System is implemented as vLLM patches and evaluated on real H200 hardware across three models and workloads; limitation is intra-node focus and no public code release.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 70%

Authors

Shouwei Gao, Junqi Yin, Feiyi Wang, Wenqian Dong

Links

Abstract / PDF

Why It Matters For Business

Flying Serving reduces tail latency during bursts and keeps throughput near DP, letting operators serve mixed-priority and long-context traffic without costly restarts or wasted GPUs.

Who Should Care

Summary TLDR

Flying Serving is a vLLM-based runtime that switches between data-parallel (many independent replicas) and tensor-parallel (sharded operators) modes in milliseconds. It avoids copying weights, moving KV cache, or recreating communicators by (1) zero-copy logical weight views, (2) an adaptor that keeps a shared KV block pool while changing logical block sizing, and (3) an eagerly initialized communicator pool plus a scheduler with soft/hard preempt modes. On three models and bursty workloads it cuts tail latency up to 4.79×, keeps near-TP latency at low load, retains ≈95–96% of DP throughput, expands single-node context to 1.9M tokens, and performs live switch in ~15 ms.

Problem Statement

Serving LLMs must maximize throughput under strict latency and memory (context) constraints. Current stacks force a static choice between data parallelism (good for throughput) and tensor parallelism (good for per-request latency and context). Workloads are bursty, mixed-priority, and long-context, so a fixed parallelism often wastes capacity or breaks SLOs. The challenge: make DP↔TP switching cheap and safe without moving large tensors or rebuilding communication groups.

Main Contribution

Design and implement Flying Serving, a middleware on vLLM v1 that allows live DP↔TP switching without restarting workers.

Model Weights Manager that provides zero-copy logical TP shards on top of DP-loaded weights so no weight movement is needed at switch time.

Key Findings

Live DP↔TP switching reduces burst P90 TTFT up to 4.79× vs static TP on tested models.

NumbersP90 TTFT reduction up to 4.79× (Nemotron-8B) under bursty trace

Practical UseIf your service faces bursts, switching to Flying Serving can sharply reduce tail latency during spikes without losing throughput.

Evidence RefSection 6.2 and Figure 8; sentence: 'Relative to static TP, Flying Serving ... 4

At low load Flying Serving achieves near-TP latency with small overheads while keeping high throughput.

NumbersLow-load TTFT overheads 510% (e.g., Llama-70B 223ms vs TP 212ms; 5.19% overhead)

Practical UseKeep the system in TP during light load to give users fast responses; the runtime only costs a small per-request penalty.

Evidence RefSection 6.2, paragraph 'Light loads (flat periods)'

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Burst P90 TTFT reduction vs static TPup to 4.79× (Nemotron-8B)static TP4.79× reductionbursty synthetic trace (Section 6.1)Section 6.2, Figure 8Fig.8 & text
Low-load average TTFT vs static TPLlama-70B 223 ms (ours) vs 212 ms (TP)static TP≈5.19% overheadlow-load windowsSection 6.2 'Light loads (flat periods)'text

What To Try In 7 Days

Run vLLM v1 with the Flying Serving patches on a single multi-GPU node and replay a bursty trace to compare TTFT and throughput.

Pre-initialize communicators for contiguous GPU groups and measure live switch latency versus a cold restart.

Enable soft- and hard-preempt policies and compare priority request latency and background throughput.

Optimization Features

Infra Optimization
topology-aware communicator initialization (contiguous NVLink groups)
Model Optimization
zero-copy logical weight views (no weight movement)
System Optimization
KV Cache Adaptor with adaptive block sizingshared physical KV pool with logical remapping
Inference Optimization
dynamic DP↔TP switchingpre-initialized communicator poolsoft preempt (speculative progress)hard preempt (latency-first interruption)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Designed for intra-node multi-GPU setups; multi-node model-parallel models fall outside scope.

Requires contiguous GPU topology (NVLink) for communicator pooling; non-contiguous topologies are unsupported.

When Not To Use

For extremely large models requiring multi-node model parallelism across machines.

In clusters without NVLink or with arbitrary non-contiguous device layouts.

Failure Modes

Soft preempt can cause recomputation overhead if speculative progress must be reconciled at merge time.

If communicator pool assumptions (contiguous groups) are violated, switching may be impossible or slower.

Core Entities

Models

Llama-3-70BGPT-OSS-120BNemotron-8B

Metrics

Time To First Token (TTFT)Time Per Output Token (TPOT)Peak generation throughput (tokens/s)Queue timeInter-Token Latency (ILT)

Datasets

ShareGPTCodeActInstructHumanEval

Benchmarks

Shift-Parallelism (baseline)Static DPStatic TP