DeepSpeed-FastGen: up to 2.3x effective throughput and much lower tail latency for LLM serving

January 9, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.45

Cost Impact Score

0.65

Citation Count

5

Authors

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He

Links

Abstract / PDF

Why It Matters For Business

FastGen can double effective throughput and sharply cut tail latency on long-prompt, streaming chat workloads, which lowers GPU cost per successful request and improves user-perceived responsiveness.

Summary TLDR

DeepSpeed-FastGen is a serving system that mixes a new scheduling strategy called Dynamic SplitFuse with DeepSpeed-MII and DeepSpeed-Inference. It decomposes long prompts and fuses short prompts to keep GPU forward passes at a consistent token size, which improves utilization and responsiveness. Reported gains vs. vLLM: up to 2.3x effective throughput, ~2x average latency reduction, and up to 3.7x reduction in P95 token latency on LLaMA-family models across A100/H100/A6000 hardware. Alpha release supports LLaMA/LLaMA-2, Mistral, and OPT and includes both short-lived and persistent deployment modes.

Problem Statement

Existing LLM serving stacks stall or preempt generation when handling long prompts, which hurts throughput and raises tail latency. Systems either do large prompt-only forwards or interrupt generation to process prompts, causing inconsistent latency and poor utilization. The paper proposes a scheduling/composition fix to avoid these stalls.

Main Contribution

Dynamic SplitFuse: a token composition policy that splits long prompts and fuses short prompts so forward passes keep a target token size.

A production-ready implementation combining DeepSpeed-MII and DeepSpeed-Inference with tuned kernels and blocked (non-contiguous) KV cache support.

Benchmarking that reports up to 2.3x effective throughput, 2x average latency reduction, and up to 3.7x P95 token-latency reduction versus vLLM.

Alpha release with easy install, interactive and persistent deployment examples, and replica-level load balancing for near-linear scaling.

Key Findings

Effective throughput improved up to 2.3× versus vLLM under chat-style SLAs.

Numbersup to 2.3x effective throughput (Section 4.3)

Average latency reduced by about 2× on some configurations.

Numbers2x lower average latency (abstract; Section 4.2)

Token-level P95 (tail) latency reduced up to 3.7× by avoiding generation preemption.

Numbersup to 3.7x lower P95 (Section 4.4)

Measured a concrete throughput example: Llama-2 70B on 4x A100s achieved 1.36 rps vs. vLLM's 0.67 rps at identical latency.

Numbers1.36 rps vs 0.67 rps (Llama-2 70B, 4 A100s)

Near-linear scaling with replica-level load balancing: 1.46 rps per replica scaled to 23.7 rps with 16 replicas (≈16×).

Numbers1.46 rps → 23.7 rps with 16 replicas (16×)

Results

Effective throughput (chat SLA)

Valueup to 2.3×

BaselinevLLM

Average latency

Value≈2× lower

BaselinevLLM

Token-level P95 latency

Valueup to 3.7× lower

BaselinevLLM

Example throughput point

Value1.36 rps (DeepSpeed-FastGen)

Baseline0.67 rps (vLLM)

Scaling with replicas

Value1.46 rps → 23.7 rps (16 replicas)

Baselinesingle-replica throughput

Who Should Care

What To Try In 7 Days

Install DeepSpeed-MII (pip) and run the provided pipeline example.

Compare effective throughput vs. your current vLLM/TGI setup on a representative prompt distribution.

Test persistent deployment with the built-in GRPC server and load balancer on 1–4 nodes.

Optimization Features

Token Efficiency

  • Composes prompt and generation tokens to meet target forward size

Infra Optimization

  • Supports Tensor Parallelism and Ampere+ GPUs (A100/H100/A6000)
  • Works with DeepSpeed-MII backend for model hosting

System Optimization

  • Replica-level load balancing for near-linear scaling
  • Precompiled DeepSpeed-Kernels to reduce install/compile friction

Inference Optimization

  • Dynamic SplitFuse continuous batching (split long prompts, fuse short prompts)
  • Blocked/non-contiguous KV cache to reduce memory fragmentation
  • Optimized CUDA kernels (FlashAttention-derived)
  • Per-token forward sizing to keep GPU in throughput-saturated region

Reproducibility

Code Urls

  • DeepSpeed-MII GitHub landing page (referenced in paper)
  • DeepSpeed GitHub (project referenced in paper)

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Alpha release with limited model family support at time of paper (LLaMA, Mistral, OPT).
  • Benchmarks use synthetic prompt/generation distributions (normal with set means and 30% variance).
  • Improvements shown mainly against vLLM; other real-world stacks might differ.
  • Requires NVIDIA Ampere+ GPUs and CUDA 11.6+ for prebuilt kernels.

When Not To Use

  • Small CPU-only setups or non-NVIDIA hardware where DeepSpeed kernels don't apply.
  • Workloads dominated by tiny single-token prompts where batching gains are minimal.
  • Environments that cannot run precompiled CUDA kernels or need strict binary compatibility.

Failure Modes

  • Uneven or adversarial prompt distributions may reduce the benefit of fixed target forward sizes.
  • Scheduling overheads could hurt very low-latency, single-request scenarios.
  • Incompatibilities with unsupported model architectures until added to the alpha release.

Core Entities

Models

  • LLaMA
  • LLaMA-2
  • Mistral
  • OPT

Metrics

  • Effective throughput (queries/s)
  • Latency (P50/P90/P95)
  • Per-token generation latency
  • Throughput-latency curve

Benchmarks

  • Throughput-latency curves
  • Effective throughput (chat SLA: prompt 512 tokens/s, gen EMA 2/4/6 tokens/s)

Context Entities

Models

  • GPT-4 (citation)
  • MPT-StoryWriter (context)