Overview
Production Readiness
0.8
Novelty Score
0.45
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
ThunderServe lets you run more LLM replicas for the same cloud spend by mapping compute-heavy and memory-bound phases to different GPU types and by cutting KV transfer costs—so you can increase throughput and meet SLOs without buying only high-end GPUs.
Summary TLDR
ThunderServe is a serving system that combines phase splitting (separate prefill and decode replicas), heterogeneity-aware scheduling, lightweight re-scheduling, and one-shot KV-cache compression to run LLMs across mixed cloud GPUs. In cloud tests (32 mixed GPUs priced ≈$13.54/hr) ThunderServe serves up to 12 replicas and shows up to 2.1× higher throughput and up to 2.5× lower latency deadlines versus state-of-the-art baselines on evaluated workloads. Lightweight re-scheduling adapts to failures or workload shifts in seconds instead of minutes. KV-cache 4-bit packing cuts communication volume sharply while keeping accuracy drops under 2% on evaluated tasks.
Problem Statement
Cloud GPU pools are heterogeneous in compute, memory bandwidth and network links. Existing serving systems either assume homogeneous high-speed interconnects or only handle hardware heterogeneity, which leads to poor utilization, high KV-cache communication cost, and slow plan updates when workloads or resources change. The paper targets cost-efficient, high-throughput LLM serving on such real-world cloud clusters.
Main Contribution
Design of a two-level scheduler (tabu search upper-level for grouping/phase designation; lower-level for parallel config and orchestration) tuned for heterogeneous cloud GPUs and network bandwidths.
Lightweight rescheduling that flips phase roles and re-orchestrates request routing in seconds without reloading model parameters.
Integration of phase splitting with KV-cache one-shot compression (quantize/pack to 4-bit for transfer, then dequantize) to cut inter-replica communication while preserving model quality.
A full implementation (Python/C++/CUDA, 20k LOC) and end-to-end evaluation on mixed cloud GPUs vs HexGen, DistServe, and vLLM under same price budgets.
Key Findings
Throughput gains over heterogeneous-cloud baseline
End-to-end latency reductions under same price
KV-cache compression cuts communication share and keeps quality
Lightweight rescheduling is fast and effective
Scheduler converges quickly for practical cluster sizes
Cloud deployment can serve many more replicas than a single in-house A100 server given same spend
Results
throughput (cloud vs HexGen)
throughput (cloud vs in-house DistServe/vLLM)
E2E latency deadline (cost-normalized)
Accuracy
scheduling runtime
rescheduling overhead
Who Should Care
What To Try In 7 Days
Profile your workload for average prompt and output length; split replicas into prefill vs decode based on bottleneck.
Run a small cluster with mixed GPUs and test 4-bit KV-cache packing to measure network savings and quality impact.
Integrate a lightweight phase-rescheduling step into deployment to recover quickly from node loss without reloading models.
Agent Features
Memory
- KV cache packing and dequantize on receive
Planning
- hierarchical scheduling (tabu search + lower-level deduction)
- two-stage transportation orchestration (LP routing)
Tool Use
- NCCL for comm groups
- libP2P for decentralized task coordination
Collaboration
- peer-to-peer request dispatch among replicas
Optimization Features
Infra Optimization
- group GPUs by interconnect bandwidth
- limit cross-node tensor parallelism to reduce network pressure
System Optimization
- heterogeneity-aware scheduling
- lightweight rescheduling (phase flips only)
- orchestration to minimize KV transfer cost
Inference Optimization
- phase splitting (separate prefill/decode replicas)
- KV-cache one-shot compression (quantize/pack to 4-bit then dequantize)
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Phase splitting relies on KV transfer; extremely low inter-instance bandwidth (e.g., cross-datacenter links) can break benefits.
- Lightweight rescheduling flips phase roles only and can be suboptimal versus full re-deployment in some cases.
- No public code or deployment scripts provided in paper, limiting immediate reproducibility.
When Not To Use
- When inter-node network bandwidth is consistently extremely low (KV transfer infeasible).
- When you must avoid any additional inter-replica communication (strict single-node co-location requirement).
- If you need open-source turnkey deployment and cannot reimplement scheduling/task coordinator.
Failure Modes
- If KV-transfer path chosen by orchestration experiences sudden bandwidth drop, throughput and SLO attainment can degrade sharply.
- Quantization-induced issues if dequantize step is omitted or buggy; paper relies on immediate dequantize before compute.
- Tabu-search may find suboptimal grouping for very large/sparse resource mixes without good initialization.
Core Entities
Models
- LLaMA-7B
- LLaMA-13B
- LLaMA-30B
Metrics
- throughput (tokens/s)
- SLO attainment (%)
- time to first token (TTFT)
- time per output token (TPOT)
- end-to-end latency (E2E)
- perplexity (PPL)
- ROUGE-1/2/L
Datasets
- Azure Conversation (coding, conversation workloads)
- CoQA
- TruthfulQA
- GSM8K
- WikiText2
- PTB
- CBT
Benchmarks
- SLO attainment
- throughput
- TTFT
- TPOT
- E2E latency
- PPL
- ROUGE
Context Entities
Models
- GPT-4 (referenced)
- OPT
- Falcon
Datasets
- BurstGPT (workload dataset referenced)

