Overview
System integrates known techniques (phase splitting, tabu search, KV quantization) into a practical scheduler for cloud heterogeneity and backs claims with real-cluster experiments and ablations.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 45%
Why It Matters For Business
ThunderServe lets you run more LLM replicas for the same cloud spend by mapping compute-heavy and memory-bound phases to different GPU types and by cutting KV transfer costs—so you can increase throughput and meet SLOs without buying only high-end GPUs.
Who Should Care
Summary TLDR
ThunderServe is a serving system that combines phase splitting (separate prefill and decode replicas), heterogeneity-aware scheduling, lightweight re-scheduling, and one-shot KV-cache compression to run LLMs across mixed cloud GPUs. In cloud tests (32 mixed GPUs priced ≈$13.54/hr) ThunderServe serves up to 12 replicas and shows up to 2.1× higher throughput and up to 2.5× lower latency deadlines versus state-of-the-art baselines on evaluated workloads. Lightweight re-scheduling adapts to failures or workload shifts in seconds instead of minutes. KV-cache 4-bit packing cuts communication volume sharply while keeping accuracy drops under 2% on evaluated tasks.
Problem Statement
Cloud GPU pools are heterogeneous in compute, memory bandwidth and network links. Existing serving systems either assume homogeneous high-speed interconnects or only handle hardware heterogeneity, which leads to poor utilization, high KV-cache communication cost, and slow plan updates when workloads or resources change. The paper targets cost-efficient, high-throughput LLM serving on such real-world cloud clusters.
Main Contribution
Design of a two-level scheduler (tabu search upper-level for grouping/phase designation; lower-level for parallel config and orchestration) tuned for heterogeneous cloud GPUs and network bandwidths.
Lightweight rescheduling that flips phase roles and re-orchestrates request routing in seconds without reloading model parameters.
Key Findings
Throughput gains over heterogeneous-cloud baseline
End-to-end latency reductions under same price
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| throughput (cloud vs HexGen) | 1.3–1.5× higher depending on workload | HexGen (heterogeneous cloud baseline) | up to 1.5× (coding), 1.3× (conversation) | coding/conversation workloads (Azure Conversation) | §5.2; Figure 9 | Figure 9 |
| throughput (cloud vs in-house DistServe/vLLM) | 1.5–2.1× higher depending on workload | DistServe, vLLM (in-house A100) | 1.5× (coding), 2.1× (conversation) in some tests | coding/conversation | §5.2; Figure 9 | Figure 9 |
What To Try In 7 Days
Profile your workload for average prompt and output length; split replicas into prefill vs decode based on bottleneck.
Run a small cluster with mixed GPUs and test 4-bit KV-cache packing to measure network savings and quality impact.
Integrate a lightweight phase-rescheduling step into deployment to recover quickly from node loss without reloading models.
Agent Features
Memory
Planning
Tool Use
Collaboration
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Phase splitting relies on KV transfer; extremely low inter-instance bandwidth (e.g., cross-datacenter links) can break benefits.
Lightweight rescheduling flips phase roles only and can be suboptimal versus full re-deployment in some cases.
When Not To Use
When inter-node network bandwidth is consistently extremely low (KV transfer infeasible).
When you must avoid any additional inter-replica communication (strict single-node co-location requirement).
Failure Modes
If KV-transfer path chosen by orchestration experiences sudden bandwidth drop, throughput and SLO attainment can degrade sharply.
Quantization-induced issues if dequantize step is omitted or buggy; paper relies on immediate dequantize before compute.

