Cut model-reload downtime by preserving GPU state and doing small P2P migrations

November 5, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.75

Citation Count

0

Authors

Wendong Xu, Chujie Chen, He Xiao, Kuan Li, Jing Xiong, Chen Zhang, Wenyong Zhou, Chaofan Tao, Yang Bai, Bei Yu, Ngai Wong

Links

Abstract / PDF

Why It Matters For Business

AnchorTP cuts recovery downtime from tens of seconds or minutes to a few seconds and shortens time to regain peak throughput. That improves user-facing latency SLOs, reduces required redundancy, and lowers cost compared to always-running replicas.

Summary TLDR

AnchorTP speeds up recovery from single-GPU failures by keeping model weights and KV caches pinned in GPU memory and using elastic tensor sharding plus a planner that minimizes host reloads. In single-node tests it cuts time-to-first-success by ~10× and time-to-peak by up to 59% versus restart-and-reload.

Problem Statement

Tensor-parallel LLM inference stops or takes minutes to recover after a GPU or link failure because existing systems require full model reloads and assume fixed equal-width sharding. This causes high downtime and slow stabilization after failures.

Main Contribution

State-preserving daemon (state plane) that pins model parameters and KV caches in GPU memory so surviving shards can be reused without full reload.

Elastic Tensor Parallelism (ETP) that allows unequal-width shards and arbitrary TP sizes so remapping works with any surviving GPU count.

Continuous Minimal Migration (CMM) planner that minimizes bytes reloaded from host by mapping interval overlaps, plus a topology-aware scheduler that pipelines P2P transfers with host reloads.

Key Findings

AnchorTP reduces time-to-first-success (TFS) by about 10× on evaluated models compared to restart-and-reload elastic TP.

NumbersQwen3-30B-A3B: 4.5s vs 48.4s (≈10.8×) at 25% failure point

CMM planner minimizes host reload volume and yields much lower reload times than naive approaches.

NumbersReload: CM 17.6s vs Greedy 26.2s vs Full Reload 197.3s; P2P ≈1.9s

Elastic recovery plus expert load balancing (EPLB) restores throughput and shortens time-to-peak.

NumbersMixtral-8×22B throughput: 436.6→562.3 tokens/s (+29%) with EPLB; TTP reduced up to 59%

Results

TFS (Qwen3-30B-A3B) @25% failure

ValueAnchorTP 4.48 ± 0.35 s; Elastic TP (restart-only) 48.43 ± 2.95 s

BaselineElastic TP (restart-only)

TFS (Mixtral-8×22B) @25% failure

ValueAnchorTP 18.71 ± 1.44 s; Elastic TP (restart-only) 195.82 ± 5.07 s

BaselineElastic TP (restart-only)

Planner reload time (8→7, Mixtral)

ValueCM 17.56 ± 1.39 s; Greedy 26.15 ± 1.73 s; Full Reload 197.31 ± 7.98 s

BaselineFull Reload

P2P transfer time (CM)

Value≈1.87 ± 0.33 s

Throughput (Mixtral MoE) with/without EPLB

ValueWithout EPLB 436.61 tokens/s; With EPLB 562.32 tokens/s (+29%)

Baselineno EPLB

Total runtime overhead reduction

ValueQwen3-30B-A3B total runtime reduced by 5.6×; Mixtral reduced by 4.7× (versus restart-only)

BaselineElastic TP (restart-only)

Who Should Care

What To Try In 7 Days

Prototype a daemon that pins a small model's weights and KV cache in GPU memory and expose IPC handles to a simple serving process.

Implement a minimal interval-mapping planner that maps surviving blocks to new shards and measure bytes read from host versus P2P.

Run a single-node failure injection and track TFS/TTP to quantify improvement before investing in full ETP integration.

Optimization Features

Infra Optimization

  • Exploit intra-node high-bandwidth links (IF/XGMI) over PCIe
  • Plan assumes P2P cheaper than host reload

System Optimization

  • Pre-allocate destination GPU buffers before migration
  • LRU eviction for KV cache to retain hot sessions

Inference Optimization

  • Elastic Tensor Parallelism (unequal-width sharding)
  • State-preserving daemon to avoid full model reloads
  • Topology-aware execution scheduling (prioritize high-bandwidth P2P paths)
  • Overlap host reloads with P2P transfers to hide latency

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation focuses on single-node multi-GPU setups; multi-node planning with cross-node reachability is not addressed.
  • Planner assumes per-byte reload cost strictly higher than P2P; this may not hold on platforms with slow P2P or fast host IO.
  • Requires spare VRAM to pin weights/KV; not suitable if GPUs operate at full memory capacity.

When Not To Use

  • Multi-node clusters where inter-node links dominate and link reachability must be planned into migration.
  • Environments with no spare GPU memory to host pinned state.
  • Settings where P2P bandwidth is worse than host reload (invalidating the cost assumption).

Failure Modes

  • If pinned KV cache exceeds daemon memory budget, eviction and on-demand replay can cause session cold-starts and temporary cold misses.
  • Planner ignores link reachability; scheduler must detect unreachable P2P paths or recovery will fall back to reloads.
  • Stale expert routing in MoE can cause temporary load imbalance before EPLB rebalancing.

Core Entities

Models

  • Qwen3-30B-A3B
  • Mixtral-8×22B
  • Qwen3-8B
  • Qwen3-14B

Metrics

  • Time to First Success (TFS)
  • Time to Peak (TTP)
  • reload time
  • P2P transfer time
  • total runtime / overhead
  • tokens/sec (throughput)

Datasets

  • ShareGPT replay (1,000 requests)

Benchmarks

  • single-node TP degradation: 4→2, 8→6, 8→7
  • end-to-end failure injections at 25% and 50% of request stream

Context Entities

Models

  • MoE