Flying Serving: instant DP↔TP switching to cut tail latency, keep throughput, and scale context

Overview

Decision SnapshotReady For Pilot

System is implemented as vLLM patches and evaluated on real H200 hardware across three models and workloads; limitation is intra-node focus and no public code release.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 70%

Authors

Shouwei Gao, Junqi Yin, Feiyi Wang, Wenqian Dong

Links

Abstract / PDF

Why It Matters For Business

Flying Serving reduces tail latency during bursts and keeps throughput near DP, letting operators serve mixed-priority and long-context traffic without costly restarts or wasted GPUs.

Who Should Care

CTO Engineering Lead ML Engineer Data Scientist

Summary TLDR

Flying Serving is a vLLM-based runtime that switches between data-parallel (many independent replicas) and tensor-parallel (sharded operators) modes in milliseconds. It avoids copying weights, moving KV cache, or recreating communicators by (1) zero-copy logical weight views, (2) an adaptor that keeps a shared KV block pool while changing logical block sizing, and (3) an eagerly initialized communicator pool plus a scheduler with soft/hard preempt modes. On three models and bursty workloads it cuts tail latency up to 4.79×, keeps near-TP latency at low load, retains ≈95–96% of DP throughput, expands single-node context to 1.9M tokens, and performs live switch in ~15 ms.

Problem Statement

Serving LLMs must maximize throughput under strict latency and memory (context) constraints. Current stacks force a static choice between data parallelism (good for throughput) and tensor parallelism (good for per-request latency and context). Workloads are bursty, mixed-priority, and long-context, so a fixed parallelism often wastes capacity or breaks SLOs. The challenge: make DP↔TP switching cheap and safe without moving large tensors or rebuilding communication groups.

Main Contribution

Design and implement Flying Serving, a middleware on vLLM v1 that allows live DP↔TP switching without restarting workers.

Model Weights Manager that provides zero-copy logical TP shards on top of DP-loaded weights so no weight movement is needed at switch time.

Key Findings

Live DP↔TP switching reduces burst P90 TTFT up to 4.79× vs static TP on tested models.

NumbersP90 TTFT reduction up to 4.79× (Nemotron-8B) under bursty trace

Practical UseIf your service faces bursts, switching to Flying Serving can sharply reduce tail latency during spikes without losing throughput.

Evidence RefSection 6.2 and Figure 8; sentence: 'Relative to static TP, Flying Serving ... 4

At low load Flying Serving achieves near-TP latency with small overheads while keeping high throughput.

NumbersLow-load TTFT overheads 5–10% (e.g., Llama-70B 223ms vs TP 212ms; 5.19% overhead)

Practical UseKeep the system in TP during light load to give users fast responses; the runtime only costs a small per-request penalty.

Evidence RefSection 6.2, paragraph 'Light loads (flat periods)'

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Burst P90 TTFT reduction vs static TP	up to 4.79× (Nemotron-8B)	static TP	4.79× reduction	bursty synthetic trace (Section 6.1)	Section 6.2, Figure 8	Fig.8 & text
Low-load average TTFT vs static TP	Llama-70B 223 ms (ours) vs 212 ms (TP)	static TP	≈5.19% overhead	low-load windows	Section 6.2 'Light loads (flat periods)'	text

What To Try In 7 Days

Run vLLM v1 with the Flying Serving patches on a single multi-GPU node and replay a bursty trace to compare TTFT and throughput.

Pre-initialize communicators for contiguous GPU groups and measure live switch latency versus a cold restart.

Enable soft- and hard-preempt policies and compare priority request latency and background throughput.

Optimization Features

Infra Optimization

topology-aware communicator initialization (contiguous NVLink groups)

Model Optimization

zero-copy logical weight views (no weight movement)

System Optimization

KV Cache Adaptor with adaptive block sizingshared physical KV pool with logical remapping

Inference Optimization

dynamic DP↔TP switchingpre-initialized communicator poolsoft preempt (speculative progress)hard preempt (latency-first interruption)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Designed for intra-node multi-GPU setups; multi-node model-parallel models fall outside scope.

Requires contiguous GPU topology (NVLink) for communicator pooling; non-contiguous topologies are unsupported.

When Not To Use

For extremely large models requiring multi-node model parallelism across machines.

In clusters without NVLink or with arbitrary non-contiguous device layouts.

Failure Modes

Soft preempt can cause recomputation overhead if speculative progress must be reconciled at merge time.

If communicator pool assumptions (contiguous groups) are violated, switching may be impossible or slower.

Core Entities

Models

Llama-3-70BGPT-OSS-120BNemotron-8B

Metrics

Time To First Token (TTFT)Time Per Output Token (TPOT)Peak generation throughput (tokens/s)Queue timeInter-Token Latency (ILT)

Datasets

ShareGPTCodeActInstructHumanEval

Benchmarks

Shift-Parallelism (baseline)Static DPStatic TP

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Live DP↔TP switching reduces burst P90 TTFT up to 4.79× vs static TP on tested models.

At low load Flying Serving achieves near-TP latency with small overheads while keeping high throughput.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Move rollout work to cheap distributed GPUs and trade small policy lag for big cost savings.

Key finding

LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

Key finding

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

Key finding

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

Key finding