Flying Serving: instant DP↔TP switching to cut tail latency, keep throughput, and scale context

February 26, 20268 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Shouwei Gao, Junqi Yin, Feiyi Wang, Wenqian Dong

Links

Abstract / PDF

Why It Matters For Business

Flying Serving reduces tail latency during bursts and keeps throughput near DP, letting operators serve mixed-priority and long-context traffic without costly restarts or wasted GPUs.

Summary TLDR

Flying Serving is a vLLM-based runtime that switches between data-parallel (many independent replicas) and tensor-parallel (sharded operators) modes in milliseconds. It avoids copying weights, moving KV cache, or recreating communicators by (1) zero-copy logical weight views, (2) an adaptor that keeps a shared KV block pool while changing logical block sizing, and (3) an eagerly initialized communicator pool plus a scheduler with soft/hard preempt modes. On three models and bursty workloads it cuts tail latency up to 4.79×, keeps near-TP latency at low load, retains ≈95–96% of DP throughput, expands single-node context to 1.9M tokens, and performs live switch in ~15 ms.

Problem Statement

Serving LLMs must maximize throughput under strict latency and memory (context) constraints. Current stacks force a static choice between data parallelism (good for throughput) and tensor parallelism (good for per-request latency and context). Workloads are bursty, mixed-priority, and long-context, so a fixed parallelism often wastes capacity or breaks SLOs. The challenge: make DP↔TP switching cheap and safe without moving large tensors or rebuilding communication groups.

Main Contribution

Design and implement Flying Serving, a middleware on vLLM v1 that allows live DP↔TP switching without restarting workers.

Model Weights Manager that provides zero-copy logical TP shards on top of DP-loaded weights so no weight movement is needed at switch time.

KV Cache Adaptor that keeps a shared physical KV block pool and changes logical block sizing to preserve KV state across modes.

Communicator Pool that pre-initializes topology-aware NCCL/Gloo groups to avoid runtime communicator creation.

Workload-aware scheduler with soft preempt (speculative progress) and hard preempt (latency-first interruption) to coordinate safe transitions.

Key Findings

Live DP↔TP switching reduces burst P90 TTFT up to 4.79× vs static TP on tested models.

NumbersP90 TTFT reduction up to 4.79× (Nemotron-8B) under bursty trace

At low load Flying Serving achieves near-TP latency with small overheads while keeping high throughput.

NumbersLow-load TTFT overheads 5–10% (e.g., Llama-70B 223ms vs TP 212ms; 5.19% overhead)

Peak throughput is preserved close to DP while improving per-token latency over DP.

NumbersRetains ≈95–96% of DP peak throughput (e.g., 3,059 vs 3,169 tokens/s) and median TPOT improved 2.31× for Llama-70B

Live switching and context scaling are orders of magnitude faster than cold restarts.

NumbersLive switch: 15 ms vs cold restart 146–292 s (∼10,000× faster)

Single-node max context increases substantially by dynamic merging.

NumbersFlying Serving supports up to 1.9M tokens vs static configs 264K and 959K (7.2× and 2.0× gains)

Results

Burst P90 TTFT reduction vs static TP

Valueup to 4.79× (Nemotron-8B)

Baselinestatic TP

Low-load average TTFT vs static TP

ValueLlama-70B 223 ms (ours) vs 212 ms (TP)

Baselinestatic TP

Median TPOT improvement vs static DP

ValueLlama-70B 2.31×, GPT-OSS-120B 1.28×, Nemotron-8B 1.30×

Baselinestatic DP

Peak throughput retained vs DP

Value≈95–96% retained (e.g., 3,059 vs 3,169 tokens/s Llama-70B)

Baselinestatic DP

Live switching latency

Value15 ms (Flying Serving)

Baselinecold restart

Max single-node context support

Value1.9M tokens (Flying Serving)

Baselinestatic configs

Mixed-priority TTFT (all requests)

ValueFlying Serving mean TTFT(all)=142 ms

Baselinestatic TP mean TTFT(all)=2130 ms

Who Should Care

What To Try In 7 Days

Run vLLM v1 with the Flying Serving patches on a single multi-GPU node and replay a bursty trace to compare TTFT and throughput.

Pre-initialize communicators for contiguous GPU groups and measure live switch latency versus a cold restart.

Enable soft- and hard-preempt policies and compare priority request latency and background throughput.

Optimization Features

Infra Optimization

  • topology-aware communicator initialization (contiguous NVLink groups)

Model Optimization

  • zero-copy logical weight views (no weight movement)

System Optimization

  • KV Cache Adaptor with adaptive block sizing
  • shared physical KV pool with logical remapping

Inference Optimization

  • dynamic DP↔TP switching
  • pre-initialized communicator pool
  • soft preempt (speculative progress)
  • hard preempt (latency-first interruption)

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Designed for intra-node multi-GPU setups; multi-node model-parallel models fall outside scope.
  • Requires contiguous GPU topology (NVLink) for communicator pooling; non-contiguous topologies are unsupported.
  • Paper presents vLLM patches but does not provide public release or deployment recipes.

When Not To Use

  • For extremely large models requiring multi-node model parallelism across machines.
  • In clusters without NVLink or with arbitrary non-contiguous device layouts.
  • When regulatory or operational constraints forbid runtime scheduler-driven preemption of requests.

Failure Modes

  • Soft preempt can cause recomputation overhead if speculative progress must be reconciled at merge time.
  • If communicator pool assumptions (contiguous groups) are violated, switching may be impossible or slower.
  • Scheduler bugs or mismatched global ordering could cause stalls; careful testing needed in production orchestration.

Core Entities

Models

  • Llama-3-70B
  • GPT-OSS-120B
  • Nemotron-8B

Metrics

  • Time To First Token (TTFT)
  • Time Per Output Token (TPOT)
  • Peak generation throughput (tokens/s)
  • Queue time
  • Inter-Token Latency (ILT)

Datasets

  • ShareGPT
  • CodeActInstruct
  • HumanEval

Benchmarks

  • Shift-Parallelism (baseline)
  • Static DP
  • Static TP