Overview
Solid single-node results on realistic models. Planner optimality proven for reload bytes under the assumption that host reload cost per byte dominates P2P. Multi-node and heterogeneous links remain future work.
Citations0
Evidence Strength0.75
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 75%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
AnchorTP cuts recovery downtime from tens of seconds or minutes to a few seconds and shortens time to regain peak throughput. That improves user-facing latency SLOs, reduces required redundancy, and lowers cost compared to always-running replicas.
Who Should Care
Summary TLDR
AnchorTP speeds up recovery from single-GPU failures by keeping model weights and KV caches pinned in GPU memory and using elastic tensor sharding plus a planner that minimizes host reloads. In single-node tests it cuts time-to-first-success by ~10× and time-to-peak by up to 59% versus restart-and-reload.
Problem Statement
Tensor-parallel LLM inference stops or takes minutes to recover after a GPU or link failure because existing systems require full model reloads and assume fixed equal-width sharding. This causes high downtime and slow stabilization after failures.
Main Contribution
State-preserving daemon (state plane) that pins model parameters and KV caches in GPU memory so surviving shards can be reused without full reload.
Elastic Tensor Parallelism (ETP) that allows unequal-width shards and arbitrary TP sizes so remapping works with any surviving GPU count.
Key Findings
AnchorTP reduces time-to-first-success (TFS) by about 10× on evaluated models compared to restart-and-reload elastic TP.
CMM planner minimizes host reload volume and yields much lower reload times than naive approaches.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| TFS (Qwen3-30B-A3B) @25% failure | AnchorTP 4.48 ± 0.35 s; Elastic TP (restart-only) 48.43 ± 2.95 s | Elastic TP (restart-only) | ≈10.8× faster | ShareGPT replay, 25% injection | Table I in paper | Table I |
| TFS (Mixtral-8×22B) @25% failure | AnchorTP 18.71 ± 1.44 s; Elastic TP (restart-only) 195.82 ± 5.07 s | Elastic TP (restart-only) | ≈10.5× faster | ShareGPT replay, 25% injection | Table I in paper | Table I |
What To Try In 7 Days
Prototype a daemon that pins a small model's weights and KV cache in GPU memory and expose IPC handles to a simple serving process.
Implement a minimal interval-mapping planner that maps surviving blocks to new shards and measure bytes read from host versus P2P.
Run a single-node failure injection and track TFS/TTP to quantify improvement before investing in full ETP integration.
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation focuses on single-node multi-GPU setups; multi-node planning with cross-node reachability is not addressed.
Planner assumes per-byte reload cost strictly higher than P2P; this may not hold on platforms with slow P2P or fast host IO.
When Not To Use
Multi-node clusters where inter-node links dominate and link reachability must be planned into migration.
Environments with no spare GPU memory to host pinned state.
Failure Modes
If pinned KV cache exceeds daemon memory budget, eviction and on-demand replay can cause session cold-starts and temporary cold misses.
Planner ignores link reachability; scheduler must detect unreachable P2P paths or recovery will fall back to reloads.

