Cut model-reload downtime by preserving GPU state and doing small P2P migrations

Overview

Decision SnapshotReady For Pilot

Solid single-node results on realistic models. Planner optimality proven for reload bytes under the assumption that host reload cost per byte dominates P2P. Multi-node and heterogeneous links remain future work.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 70%

Novelty: 70%

Authors

Wendong Xu, Chujie Chen, He Xiao, Kuan Li, Jing Xiong, Chen Zhang, Wenyong Zhou, Chaofan Tao, Yang Bai, Bei Yu, Ngai Wong

Links

Abstract / PDF

Why It Matters For Business

AnchorTP cuts recovery downtime from tens of seconds or minutes to a few seconds and shortens time to regain peak throughput. That improves user-facing latency SLOs, reduces required redundancy, and lowers cost compared to always-running replicas.

Who Should Care

Engineering Lead ML Engineer CTO Product Manager

Summary TLDR

AnchorTP speeds up recovery from single-GPU failures by keeping model weights and KV caches pinned in GPU memory and using elastic tensor sharding plus a planner that minimizes host reloads. In single-node tests it cuts time-to-first-success by ~10× and time-to-peak by up to 59% versus restart-and-reload.

Problem Statement

Tensor-parallel LLM inference stops or takes minutes to recover after a GPU or link failure because existing systems require full model reloads and assume fixed equal-width sharding. This causes high downtime and slow stabilization after failures.

Main Contribution

State-preserving daemon (state plane) that pins model parameters and KV caches in GPU memory so surviving shards can be reused without full reload.

Elastic Tensor Parallelism (ETP) that allows unequal-width shards and arbitrary TP sizes so remapping works with any surviving GPU count.

Key Findings

AnchorTP reduces time-to-first-success (TFS) by about 10× on evaluated models compared to restart-and-reload elastic TP.

NumbersQwen3-30B-A3B: 4.5s vs 48.4s (≈10.8×) at 25% failure point

Practical UseIf you pin GPU state and reuse shards, recoveries that took tens of seconds to minutes can drop to a few seconds, improving SLOs for serving systems.

Evidence RefTable I; evaluation section

CMM planner minimizes host reload volume and yields much lower reload times than naive approaches.

NumbersReload: CM 17.6s vs Greedy 26.2s vs Full Reload 197.3s; P2P ≈1.9s

Practical UseUsing interval-based reuse planning and P2P transfers can cut bytes read from host by an order of magnitude, reducing wall-clock recovery time.

Evidence RefTable II; planner comparison

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TFS (Qwen3-30B-A3B) @25% failure	AnchorTP 4.48 ± 0.35 s; Elastic TP (restart-only) 48.43 ± 2.95 s	Elastic TP (restart-only)	≈10.8× faster	ShareGPT replay, 25% injection	Table I in paper	Table I
TFS (Mixtral-8×22B) @25% failure	AnchorTP 18.71 ± 1.44 s; Elastic TP (restart-only) 195.82 ± 5.07 s	Elastic TP (restart-only)	≈10.5× faster	ShareGPT replay, 25% injection	Table I in paper	Table I

What To Try In 7 Days

Prototype a daemon that pins a small model's weights and KV cache in GPU memory and expose IPC handles to a simple serving process.

Implement a minimal interval-mapping planner that maps surviving blocks to new shards and measure bytes read from host versus P2P.

Run a single-node failure injection and track TFS/TTP to quantify improvement before investing in full ETP integration.

Optimization Features

Infra Optimization

Exploit intra-node high-bandwidth links (IF/XGMI) over PCIePlan assumes P2P cheaper than host reload

System Optimization

Pre-allocate destination GPU buffers before migrationLRU eviction for KV cache to retain hot sessions

Inference Optimization

Elastic Tensor Parallelism (unequal-width sharding)State-preserving daemon to avoid full model reloadsTopology-aware execution scheduling (prioritize high-bandwidth P2P paths)Overlap host reloads with P2P transfers to hide latency

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation focuses on single-node multi-GPU setups; multi-node planning with cross-node reachability is not addressed.

Planner assumes per-byte reload cost strictly higher than P2P; this may not hold on platforms with slow P2P or fast host IO.

When Not To Use

Multi-node clusters where inter-node links dominate and link reachability must be planned into migration.

Environments with no spare GPU memory to host pinned state.

Failure Modes

If pinned KV cache exceeds daemon memory budget, eviction and on-demand replay can cause session cold-starts and temporary cold misses.

Planner ignores link reachability; scheduler must detect unreachable P2P paths or recovery will fall back to reloads.

Core Entities

Models

Qwen3-30B-A3BMixtral-8×22BQwen3-8BQwen3-14B

Metrics

Time to First Success (TFS)Time to Peak (TTP)reload timeP2P transfer timetotal runtime / overheadtokens/sec (throughput)

Datasets

ShareGPT replay (1,000 requests)

Benchmarks

single-node TP degradation: 4→2, 8→6, 8→7end-to-end failure injections at 25% and 50% of request stream

Context Entities

Models

MoE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AnchorTP reduces time-to-first-success (TFS) by about 10× on evaluated models compared to restart-and-reload elastic TP.

CMM planner minimizes host reload volume and yields much lower reload times than naive approaches.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Move rollout work to cheap distributed GPUs and trade small policy lag for big cost savings.

Key finding

LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

Key finding

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

Key finding

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

Key finding