ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

February 13, 20259 min

Overview

Production Readiness

0.8

Novelty Score

0.45

Cost Impact Score

0.8

Citation Count

0

Authors

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki

Links

Abstract / PDF

Why It Matters For Business

ThunderServe lets you run more LLM replicas for the same cloud spend by mapping compute-heavy and memory-bound phases to different GPU types and by cutting KV transfer costs—so you can increase throughput and meet SLOs without buying only high-end GPUs.

Summary TLDR

ThunderServe is a serving system that combines phase splitting (separate prefill and decode replicas), heterogeneity-aware scheduling, lightweight re-scheduling, and one-shot KV-cache compression to run LLMs across mixed cloud GPUs. In cloud tests (32 mixed GPUs priced ≈$13.54/hr) ThunderServe serves up to 12 replicas and shows up to 2.1× higher throughput and up to 2.5× lower latency deadlines versus state-of-the-art baselines on evaluated workloads. Lightweight re-scheduling adapts to failures or workload shifts in seconds instead of minutes. KV-cache 4-bit packing cuts communication volume sharply while keeping accuracy drops under 2% on evaluated tasks.

Problem Statement

Cloud GPU pools are heterogeneous in compute, memory bandwidth and network links. Existing serving systems either assume homogeneous high-speed interconnects or only handle hardware heterogeneity, which leads to poor utilization, high KV-cache communication cost, and slow plan updates when workloads or resources change. The paper targets cost-efficient, high-throughput LLM serving on such real-world cloud clusters.

Main Contribution

Design of a two-level scheduler (tabu search upper-level for grouping/phase designation; lower-level for parallel config and orchestration) tuned for heterogeneous cloud GPUs and network bandwidths.

Lightweight rescheduling that flips phase roles and re-orchestrates request routing in seconds without reloading model parameters.

Integration of phase splitting with KV-cache one-shot compression (quantize/pack to 4-bit for transfer, then dequantize) to cut inter-replica communication while preserving model quality.

A full implementation (Python/C++/CUDA, 20k LOC) and end-to-end evaluation on mixed cloud GPUs vs HexGen, DistServe, and vLLM under same price budgets.

Key Findings

Throughput gains over heterogeneous-cloud baseline

Numbersup to 2.1×, average 1.7× (throughput) vs state-of-the-art on tested cloud setup

End-to-end latency reductions under same price

Numbersup to 2.5×, average 1.5× (E2E latency deadlines) vs baselines under same budget

KV-cache compression cuts communication share and keeps quality

NumbersKV comm reduced from 16–30% to 4–9% of E2E cost; accuracy drop <2% on evaluated tasks

Lightweight rescheduling is fast and effective

Numberslightweight rescheduling ≈13s vs full rescheduling ≈157s total (includes reload); achieves similar SLO attainment

Scheduler converges quickly for practical cluster sizes

Numbersconvergence times ≈21s (16 GPUs), 36s (24 GPUs), 54s (32 GPUs)

Cloud deployment can serve many more replicas than a single in-house A100 server given same spend

Numberscloud setup serves up to 12 replicas vs 4 replicas on 8×A100 in-house under similar price

Results

throughput (cloud vs HexGen)

Value1.3–1.5× higher depending on workload

BaselineHexGen (heterogeneous cloud baseline)

throughput (cloud vs in-house DistServe/vLLM)

Value1.5–2.1× higher depending on workload

BaselineDistServe, vLLM (in-house A100)

E2E latency deadline (cost-normalized)

Valueup to 2.5× lower, avg 1.5–1.8× lower

Baselinestate-of-the-art systems under same price budget

Accuracy

ValueKV comm share reduced to 4–9% of E2E cost with 4-bit; accuracy drop <2%

Baseline16-bit KV transfer

scheduling runtime

Value≈21s (16 GPUs), 36s (24 GPUs), 54s (32 GPUs)

BaselineN/A

rescheduling overhead

Valuelightweight rescheduling ≈13s total; full ≈157s (includes reload)

Baselinefull re-scheduling with reload

Who Should Care

What To Try In 7 Days

Profile your workload for average prompt and output length; split replicas into prefill vs decode based on bottleneck.

Run a small cluster with mixed GPUs and test 4-bit KV-cache packing to measure network savings and quality impact.

Integrate a lightweight phase-rescheduling step into deployment to recover quickly from node loss without reloading models.

Agent Features

Memory

  • KV cache packing and dequantize on receive

Planning

  • hierarchical scheduling (tabu search + lower-level deduction)
  • two-stage transportation orchestration (LP routing)

Tool Use

  • NCCL for comm groups
  • libP2P for decentralized task coordination

Collaboration

  • peer-to-peer request dispatch among replicas

Optimization Features

Infra Optimization

  • group GPUs by interconnect bandwidth
  • limit cross-node tensor parallelism to reduce network pressure

System Optimization

  • heterogeneity-aware scheduling
  • lightweight rescheduling (phase flips only)
  • orchestration to minimize KV transfer cost

Inference Optimization

  • phase splitting (separate prefill/decode replicas)
  • KV-cache one-shot compression (quantize/pack to 4-bit then dequantize)

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Phase splitting relies on KV transfer; extremely low inter-instance bandwidth (e.g., cross-datacenter links) can break benefits.
  • Lightweight rescheduling flips phase roles only and can be suboptimal versus full re-deployment in some cases.
  • No public code or deployment scripts provided in paper, limiting immediate reproducibility.

When Not To Use

  • When inter-node network bandwidth is consistently extremely low (KV transfer infeasible).
  • When you must avoid any additional inter-replica communication (strict single-node co-location requirement).
  • If you need open-source turnkey deployment and cannot reimplement scheduling/task coordinator.

Failure Modes

  • If KV-transfer path chosen by orchestration experiences sudden bandwidth drop, throughput and SLO attainment can degrade sharply.
  • Quantization-induced issues if dequantize step is omitted or buggy; paper relies on immediate dequantize before compute.
  • Tabu-search may find suboptimal grouping for very large/sparse resource mixes without good initialization.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-30B

Metrics

  • throughput (tokens/s)
  • SLO attainment (%)
  • time to first token (TTFT)
  • time per output token (TPOT)
  • end-to-end latency (E2E)
  • perplexity (PPL)
  • ROUGE-1/2/L

Datasets

  • Azure Conversation (coding, conversation workloads)
  • CoQA
  • TruthfulQA
  • GSM8K
  • WikiText2
  • PTB
  • CBT

Benchmarks

  • SLO attainment
  • throughput
  • TTFT
  • TPOT
  • E2E latency
  • PPL
  • ROUGE

Context Entities

Models

  • GPT-4 (referenced)
  • OPT
  • Falcon

Datasets

  • BurstGPT (workload dataset referenced)