Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.75
Citation Count
0
Why It Matters For Business
AnchorTP cuts recovery downtime from tens of seconds or minutes to a few seconds and shortens time to regain peak throughput. That improves user-facing latency SLOs, reduces required redundancy, and lowers cost compared to always-running replicas.
Summary TLDR
AnchorTP speeds up recovery from single-GPU failures by keeping model weights and KV caches pinned in GPU memory and using elastic tensor sharding plus a planner that minimizes host reloads. In single-node tests it cuts time-to-first-success by ~10× and time-to-peak by up to 59% versus restart-and-reload.
Problem Statement
Tensor-parallel LLM inference stops or takes minutes to recover after a GPU or link failure because existing systems require full model reloads and assume fixed equal-width sharding. This causes high downtime and slow stabilization after failures.
Main Contribution
State-preserving daemon (state plane) that pins model parameters and KV caches in GPU memory so surviving shards can be reused without full reload.
Elastic Tensor Parallelism (ETP) that allows unequal-width shards and arbitrary TP sizes so remapping works with any surviving GPU count.
Continuous Minimal Migration (CMM) planner that minimizes bytes reloaded from host by mapping interval overlaps, plus a topology-aware scheduler that pipelines P2P transfers with host reloads.
Key Findings
AnchorTP reduces time-to-first-success (TFS) by about 10× on evaluated models compared to restart-and-reload elastic TP.
CMM planner minimizes host reload volume and yields much lower reload times than naive approaches.
Elastic recovery plus expert load balancing (EPLB) restores throughput and shortens time-to-peak.
Results
TFS (Qwen3-30B-A3B) @25% failure
TFS (Mixtral-8×22B) @25% failure
Planner reload time (8→7, Mixtral)
P2P transfer time (CM)
Throughput (Mixtral MoE) with/without EPLB
Total runtime overhead reduction
Who Should Care
What To Try In 7 Days
Prototype a daemon that pins a small model's weights and KV cache in GPU memory and expose IPC handles to a simple serving process.
Implement a minimal interval-mapping planner that maps surviving blocks to new shards and measure bytes read from host versus P2P.
Run a single-node failure injection and track TFS/TTP to quantify improvement before investing in full ETP integration.
Optimization Features
Infra Optimization
- Exploit intra-node high-bandwidth links (IF/XGMI) over PCIe
- Plan assumes P2P cheaper than host reload
System Optimization
- Pre-allocate destination GPU buffers before migration
- LRU eviction for KV cache to retain hot sessions
Inference Optimization
- Elastic Tensor Parallelism (unequal-width sharding)
- State-preserving daemon to avoid full model reloads
- Topology-aware execution scheduling (prioritize high-bandwidth P2P paths)
- Overlap host reloads with P2P transfers to hide latency
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation focuses on single-node multi-GPU setups; multi-node planning with cross-node reachability is not addressed.
- Planner assumes per-byte reload cost strictly higher than P2P; this may not hold on platforms with slow P2P or fast host IO.
- Requires spare VRAM to pin weights/KV; not suitable if GPUs operate at full memory capacity.
When Not To Use
- Multi-node clusters where inter-node links dominate and link reachability must be planned into migration.
- Environments with no spare GPU memory to host pinned state.
- Settings where P2P bandwidth is worse than host reload (invalidating the cost assumption).
Failure Modes
- If pinned KV cache exceeds daemon memory budget, eviction and on-demand replay can cause session cold-starts and temporary cold misses.
- Planner ignores link reachability; scheduler must detect unreachable P2P paths or recovery will fall back to reloads.
- Stale expert routing in MoE can cause temporary load imbalance before EPLB rebalancing.
Core Entities
Models
- Qwen3-30B-A3B
- Mixtral-8×22B
- Qwen3-8B
- Qwen3-14B
Metrics
- Time to First Success (TFS)
- Time to Peak (TTP)
- reload time
- P2P transfer time
- total runtime / overhead
- tokens/sec (throughput)
Datasets
- ShareGPT replay (1,000 requests)
Benchmarks
- single-node TP degradation: 4→2, 8→6, 8→7
- end-to-end failure injections at 25% and 50% of request stream
Context Entities
Models
- MoE

