ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

February 13, 20259 min

Overview

Decision SnapshotReady For Pilot

System integrates known techniques (phase splitting, tabu search, KV quantization) into a practical scheduler for cloud heterogeneity and backs claims with real-cluster experiments and ablations.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 45%

Authors

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki

Links

Abstract / PDF

Why It Matters For Business

ThunderServe lets you run more LLM replicas for the same cloud spend by mapping compute-heavy and memory-bound phases to different GPU types and by cutting KV transfer costs—so you can increase throughput and meet SLOs without buying only high-end GPUs.

Who Should Care

Summary TLDR

ThunderServe is a serving system that combines phase splitting (separate prefill and decode replicas), heterogeneity-aware scheduling, lightweight re-scheduling, and one-shot KV-cache compression to run LLMs across mixed cloud GPUs. In cloud tests (32 mixed GPUs priced ≈$13.54/hr) ThunderServe serves up to 12 replicas and shows up to 2.1× higher throughput and up to 2.5× lower latency deadlines versus state-of-the-art baselines on evaluated workloads. Lightweight re-scheduling adapts to failures or workload shifts in seconds instead of minutes. KV-cache 4-bit packing cuts communication volume sharply while keeping accuracy drops under 2% on evaluated tasks.

Problem Statement

Cloud GPU pools are heterogeneous in compute, memory bandwidth and network links. Existing serving systems either assume homogeneous high-speed interconnects or only handle hardware heterogeneity, which leads to poor utilization, high KV-cache communication cost, and slow plan updates when workloads or resources change. The paper targets cost-efficient, high-throughput LLM serving on such real-world cloud clusters.

Main Contribution

Design of a two-level scheduler (tabu search upper-level for grouping/phase designation; lower-level for parallel config and orchestration) tuned for heterogeneous cloud GPUs and network bandwidths.

Lightweight rescheduling that flips phase roles and re-orchestrates request routing in seconds without reloading model parameters.

Key Findings

Throughput gains over heterogeneous-cloud baseline

Numbersup to 2.1×, average 1.7× (throughput) vs state-of-the-art on tested cloud setup

Practical UseIf you run LLM inference on mixed cloud GPUs, using ThunderServe's scheduler and phase split can materially raise tokens/sec for the same cloud spend; try grouping GPUs by phase (compute-heavy vs bandwidth-heavy).

Evidence RefAbstract; §5.2; Figure 9

End-to-end latency reductions under same price

Numbersup to 2.5×, average 1.5× (E2E latency deadlines) vs baselines under same budget

Practical UseFor latency-sensitive services, allocate budget to heterogeneous cloud GPUs and use ThunderServe to lower required latency SLOs instead of buying only high-end GPUs.

Evidence RefAbstract; §5.2; Figure 8

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
throughput (cloud vs HexGen)1.31.5× higher depending on workloadHexGen (heterogeneous cloud baseline)up to 1.5× (coding), 1.3× (conversation)coding/conversation workloads (Azure Conversation)§5.2; Figure 9Figure 9
throughput (cloud vs in-house DistServe/vLLM)1.52.1× higher depending on workloadDistServe, vLLM (in-house A100)1.5× (coding), 2.1× (conversation) in some testscoding/conversation§5.2; Figure 9Figure 9

What To Try In 7 Days

Profile your workload for average prompt and output length; split replicas into prefill vs decode based on bottleneck.

Run a small cluster with mixed GPUs and test 4-bit KV-cache packing to measure network savings and quality impact.

Integrate a lightweight phase-rescheduling step into deployment to recover quickly from node loss without reloading models.

Agent Features

Memory
KV cache packing and dequantize on receive
Planning
hierarchical scheduling (tabu search + lower-level deduction)two-stage transportation orchestration (LP routing)
Tool Use
NCCL for comm groupslibP2P for decentralized task coordination
Collaboration
peer-to-peer request dispatch among replicas

Optimization Features

Infra Optimization
group GPUs by interconnect bandwidthlimit cross-node tensor parallelism to reduce network pressure
System Optimization
heterogeneity-aware schedulinglightweight rescheduling (phase flips only)orchestration to minimize KV transfer cost
Inference Optimization
phase splitting (separate prefill/decode replicas)KV-cache one-shot compression (quantize/pack to 4-bit then dequantize)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Phase splitting relies on KV transfer; extremely low inter-instance bandwidth (e.g., cross-datacenter links) can break benefits.

Lightweight rescheduling flips phase roles only and can be suboptimal versus full re-deployment in some cases.

When Not To Use

When inter-node network bandwidth is consistently extremely low (KV transfer infeasible).

When you must avoid any additional inter-replica communication (strict single-node co-location requirement).

Failure Modes

If KV-transfer path chosen by orchestration experiences sudden bandwidth drop, throughput and SLO attainment can degrade sharply.

Quantization-induced issues if dequantize step is omitted or buggy; paper relies on immediate dequantize before compute.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-30B

Metrics

throughput (tokens/s)SLO attainment (%)time to first token (TTFT)time per output token (TPOT)end-to-end latency (E2E)perplexity (PPL)ROUGE-1/2/L

Datasets

Azure Conversation (coding, conversation workloads)CoQATruthfulQAGSM8KWikiText2PTBCBT

Benchmarks

SLO attainmentthroughputTTFTTPOTE2E latencyPPLROUGE

Context Entities

Models

GPT-4 (referenced)OPTFalcon

Datasets

BurstGPT (workload dataset referenced)