ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

Overview

Decision SnapshotReady For Pilot

System integrates known techniques (phase splitting, tabu search, KV quantization) into a practical scheduler for cloud heterogeneity and backs claims with real-cluster experiments and ablations.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 45%

Authors

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki

Links

Abstract / PDF

Why It Matters For Business

ThunderServe lets you run more LLM replicas for the same cloud spend by mapping compute-heavy and memory-bound phases to different GPU types and by cutting KV transfer costs—so you can increase throughput and meet SLOs without buying only high-end GPUs.

Who Should Care

Product Manager CTO Engineering Lead ML Engineer

Summary TLDR

ThunderServe is a serving system that combines phase splitting (separate prefill and decode replicas), heterogeneity-aware scheduling, lightweight re-scheduling, and one-shot KV-cache compression to run LLMs across mixed cloud GPUs. In cloud tests (32 mixed GPUs priced ≈$13.54/hr) ThunderServe serves up to 12 replicas and shows up to 2.1× higher throughput and up to 2.5× lower latency deadlines versus state-of-the-art baselines on evaluated workloads. Lightweight re-scheduling adapts to failures or workload shifts in seconds instead of minutes. KV-cache 4-bit packing cuts communication volume sharply while keeping accuracy drops under 2% on evaluated tasks.

Problem Statement

Cloud GPU pools are heterogeneous in compute, memory bandwidth and network links. Existing serving systems either assume homogeneous high-speed interconnects or only handle hardware heterogeneity, which leads to poor utilization, high KV-cache communication cost, and slow plan updates when workloads or resources change. The paper targets cost-efficient, high-throughput LLM serving on such real-world cloud clusters.

Main Contribution

Design of a two-level scheduler (tabu search upper-level for grouping/phase designation; lower-level for parallel config and orchestration) tuned for heterogeneous cloud GPUs and network bandwidths.

Lightweight rescheduling that flips phase roles and re-orchestrates request routing in seconds without reloading model parameters.

Key Findings

Throughput gains over heterogeneous-cloud baseline

Numbersup to 2.1×, average 1.7× (throughput) vs state-of-the-art on tested cloud setup

Practical UseIf you run LLM inference on mixed cloud GPUs, using ThunderServe's scheduler and phase split can materially raise tokens/sec for the same cloud spend; try grouping GPUs by phase (compute-heavy vs bandwidth-heavy).

Evidence RefAbstract; §5.2; Figure 9

End-to-end latency reductions under same price

Numbersup to 2.5×, average 1.5× (E2E latency deadlines) vs baselines under same budget

Practical UseFor latency-sensitive services, allocate budget to heterogeneous cloud GPUs and use ThunderServe to lower required latency SLOs instead of buying only high-end GPUs.

Evidence RefAbstract; §5.2; Figure 8

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
throughput (cloud vs HexGen)	1.3–1.5× higher depending on workload	HexGen (heterogeneous cloud baseline)	up to 1.5× (coding), 1.3× (conversation)	coding/conversation workloads (Azure Conversation)	§5.2; Figure 9	Figure 9
throughput (cloud vs in-house DistServe/vLLM)	1.5–2.1× higher depending on workload	DistServe, vLLM (in-house A100)	1.5× (coding), 2.1× (conversation) in some tests	coding/conversation	§5.2; Figure 9	Figure 9

What To Try In 7 Days

Profile your workload for average prompt and output length; split replicas into prefill vs decode based on bottleneck.

Run a small cluster with mixed GPUs and test 4-bit KV-cache packing to measure network savings and quality impact.

Integrate a lightweight phase-rescheduling step into deployment to recover quickly from node loss without reloading models.

Agent Features

Memory

KV cache packing and dequantize on receive

Planning

hierarchical scheduling (tabu search + lower-level deduction)two-stage transportation orchestration (LP routing)

Tool Use

NCCL for comm groupslibP2P for decentralized task coordination

Collaboration

peer-to-peer request dispatch among replicas

Optimization Features

Infra Optimization

group GPUs by interconnect bandwidthlimit cross-node tensor parallelism to reduce network pressure

System Optimization

heterogeneity-aware schedulinglightweight rescheduling (phase flips only)orchestration to minimize KV transfer cost

Inference Optimization

phase splitting (separate prefill/decode replicas)KV-cache one-shot compression (quantize/pack to 4-bit then dequantize)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Phase splitting relies on KV transfer; extremely low inter-instance bandwidth (e.g., cross-datacenter links) can break benefits.

Lightweight rescheduling flips phase roles only and can be suboptimal versus full re-deployment in some cases.

When Not To Use

When inter-node network bandwidth is consistently extremely low (KV transfer infeasible).

When you must avoid any additional inter-replica communication (strict single-node co-location requirement).

Failure Modes

If KV-transfer path chosen by orchestration experiences sudden bandwidth drop, throughput and SLO attainment can degrade sharply.

Quantization-induced issues if dequantize step is omitted or buggy; paper relies on immediate dequantize before compute.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-30B

Metrics

throughput (tokens/s)SLO attainment (%)time to first token (TTFT)time per output token (TPOT)end-to-end latency (E2E)perplexity (PPL)ROUGE-1/2/L

Datasets

Azure Conversation (coding, conversation workloads)CoQATruthfulQAGSM8KWikiText2PTBCBT

Benchmarks

SLO attainmentthroughputTTFTTPOTE2E latencyPPLROUGE

Context Entities

Models

GPT-4 (referenced)OPTFalcon

Datasets

BurstGPT (workload dataset referenced)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Throughput gains over heterogeneous-cloud baseline

End-to-end latency reductions under same price

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Multi-agent system + rubric RL that writes and optimizes full end-to-end CUDA programs

Key finding

Practical guide to cutting cloud and AI infra costs 28–90% using instance choices, quantization, and FinOps

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Block-wise Adam that lets you full-finetune 8B+ LLMs on a single 24GB GPU

Key finding