Cut model-reload downtime by preserving GPU state and doing small P2P migrations

November 5, 20257 min

Overview

Decision SnapshotReady For Pilot

Solid single-node results on realistic models. Planner optimality proven for reload bytes under the assumption that host reload cost per byte dominates P2P. Multi-node and heterogeneous links remain future work.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 70%

Novelty: 70%

Authors

Wendong Xu, Chujie Chen, He Xiao, Kuan Li, Jing Xiong, Chen Zhang, Wenyong Zhou, Chaofan Tao, Yang Bai, Bei Yu, Ngai Wong

Links

Abstract / PDF

Why It Matters For Business

AnchorTP cuts recovery downtime from tens of seconds or minutes to a few seconds and shortens time to regain peak throughput. That improves user-facing latency SLOs, reduces required redundancy, and lowers cost compared to always-running replicas.

Who Should Care

Summary TLDR

AnchorTP speeds up recovery from single-GPU failures by keeping model weights and KV caches pinned in GPU memory and using elastic tensor sharding plus a planner that minimizes host reloads. In single-node tests it cuts time-to-first-success by ~10× and time-to-peak by up to 59% versus restart-and-reload.

Problem Statement

Tensor-parallel LLM inference stops or takes minutes to recover after a GPU or link failure because existing systems require full model reloads and assume fixed equal-width sharding. This causes high downtime and slow stabilization after failures.

Main Contribution

State-preserving daemon (state plane) that pins model parameters and KV caches in GPU memory so surviving shards can be reused without full reload.

Elastic Tensor Parallelism (ETP) that allows unequal-width shards and arbitrary TP sizes so remapping works with any surviving GPU count.

Key Findings

AnchorTP reduces time-to-first-success (TFS) by about 10× on evaluated models compared to restart-and-reload elastic TP.

NumbersQwen3-30B-A3B: 4.5s vs 48.4s (≈10.8×) at 25% failure point

Practical UseIf you pin GPU state and reuse shards, recoveries that took tens of seconds to minutes can drop to a few seconds, improving SLOs for serving systems.

Evidence RefTable I; evaluation section

CMM planner minimizes host reload volume and yields much lower reload times than naive approaches.

NumbersReload: CM 17.6s vs Greedy 26.2s vs Full Reload 197.3s; P2P ≈1.9s

Practical UseUsing interval-based reuse planning and P2P transfers can cut bytes read from host by an order of magnitude, reducing wall-clock recovery time.

Evidence RefTable II; planner comparison

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TFS (Qwen3-30B-A3B) @25% failureAnchorTP 4.48 ± 0.35 s; Elastic TP (restart-only) 48.43 ± 2.95 sElastic TP (restart-only)≈10.8× fasterShareGPT replay, 25% injectionTable I in paperTable I
TFS (Mixtral-8×22B) @25% failureAnchorTP 18.71 ± 1.44 s; Elastic TP (restart-only) 195.82 ± 5.07 sElastic TP (restart-only)≈10.5× fasterShareGPT replay, 25% injectionTable I in paperTable I

What To Try In 7 Days

Prototype a daemon that pins a small model's weights and KV cache in GPU memory and expose IPC handles to a simple serving process.

Implement a minimal interval-mapping planner that maps surviving blocks to new shards and measure bytes read from host versus P2P.

Run a single-node failure injection and track TFS/TTP to quantify improvement before investing in full ETP integration.

Optimization Features

Infra Optimization
Exploit intra-node high-bandwidth links (IF/XGMI) over PCIePlan assumes P2P cheaper than host reload
System Optimization
Pre-allocate destination GPU buffers before migrationLRU eviction for KV cache to retain hot sessions
Inference Optimization
Elastic Tensor Parallelism (unequal-width sharding)State-preserving daemon to avoid full model reloadsTopology-aware execution scheduling (prioritize high-bandwidth P2P paths)Overlap host reloads with P2P transfers to hide latency

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation focuses on single-node multi-GPU setups; multi-node planning with cross-node reachability is not addressed.

Planner assumes per-byte reload cost strictly higher than P2P; this may not hold on platforms with slow P2P or fast host IO.

When Not To Use

Multi-node clusters where inter-node links dominate and link reachability must be planned into migration.

Environments with no spare GPU memory to host pinned state.

Failure Modes

If pinned KV cache exceeds daemon memory budget, eviction and on-demand replay can cause session cold-starts and temporary cold misses.

Planner ignores link reachability; scheduler must detect unreachable P2P paths or recovery will fall back to reloads.

Core Entities

Models

Qwen3-30B-A3BMixtral-8×22BQwen3-8BQwen3-14B

Metrics

Time to First Success (TFS)Time to Peak (TTP)reload timeP2P transfer timetotal runtime / overheadtokens/sec (throughput)

Datasets

ShareGPT replay (1,000 requests)

Benchmarks

single-node TP degradation: 4→2, 8→6, 8→7end-to-end failure injections at 25% and 50% of request stream

Context Entities

Models

MoE