DeepSpeed-FastGen: up to 2.3x effective throughput and much lower tail latency for LLM serving

January 9, 20247 min

Overview

Decision SnapshotReady For Pilot

Strong systems engineering and measured speedups on multiple GPUs give practical value now, but results are from an alpha release and synthetic workload distributions.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 80%

Novelty: 45%

Authors

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He

Links

Abstract / PDF / Code

Why It Matters For Business

FastGen can double effective throughput and sharply cut tail latency on long-prompt, streaming chat workloads, which lowers GPU cost per successful request and improves user-perceived responsiveness.

Who Should Care

Summary TLDR

DeepSpeed-FastGen is a serving system that mixes a new scheduling strategy called Dynamic SplitFuse with DeepSpeed-MII and DeepSpeed-Inference. It decomposes long prompts and fuses short prompts to keep GPU forward passes at a consistent token size, which improves utilization and responsiveness. Reported gains vs. vLLM: up to 2.3x effective throughput, ~2x average latency reduction, and up to 3.7x reduction in P95 token latency on LLaMA-family models across A100/H100/A6000 hardware. Alpha release supports LLaMA/LLaMA-2, Mistral, and OPT and includes both short-lived and persistent deployment modes.

Problem Statement

Existing LLM serving stacks stall or preempt generation when handling long prompts, which hurts throughput and raises tail latency. Systems either do large prompt-only forwards or interrupt generation to process prompts, causing inconsistent latency and poor utilization. The paper proposes a scheduling/composition fix to avoid these stalls.

Main Contribution

Dynamic SplitFuse: a token composition policy that splits long prompts and fuses short prompts so forward passes keep a target token size.

A production-ready implementation combining DeepSpeed-MII and DeepSpeed-Inference with tuned kernels and blocked (non-contiguous) KV cache support.

Key Findings

Effective throughput improved up to 2.3× versus vLLM under chat-style SLAs.

Numbersup to 2.3x effective throughput (Section 4.3)

Practical UseReplace or benchmark FastGen in front of chat services to increase successful requests per GPU under same SLA.

Evidence RefSection 4.3, Figure 6

Average latency reduced by about 2× on some configurations.

Numbers2x lower average latency (abstract; Section 4.2)

Practical UseExpect noticeably faster first-token and interactive responses for long-prompt workloads.

Evidence RefSection 4.2, Llama-2 70B experiment

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Effective throughput (chat SLA)up to 2.3×vLLM×2.3Llama-2 7B/13B/70B across A100/H100/A6000Section 4.3; plotted in Figure 6Section 4.3, Figure 6
Average latency≈2× lowervLLM2xLlama-2 70B (4 A100s)Section 4.2; 1.36 rps vs 0.67 rps at same latencySection 4.2

What To Try In 7 Days

Install DeepSpeed-MII (pip) and run the provided pipeline example.

Compare effective throughput vs. your current vLLM/TGI setup on a representative prompt distribution.

Test persistent deployment with the built-in GRPC server and load balancer on 1–4 nodes.

Optimization Features

Token Efficiency
Composes prompt and generation tokens to meet target forward size
Infra Optimization
Supports Tensor Parallelism and Ampere+ GPUs (A100/H100/A6000)Works with DeepSpeed-MII backend for model hosting
System Optimization
Replica-level load balancing for near-linear scalingPrecompiled DeepSpeed-Kernels to reduce install/compile friction
Inference Optimization
Dynamic SplitFuse continuous batching (split long prompts, fuse short prompts)Blocked/non-contiguous KV cache to reduce memory fragmentationOptimized CUDA kernels (FlashAttention-derived)Per-token forward sizing to keep GPU in throughput-saturated region

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Code URLs

DeepSpeed-MII GitHub landing page (referenced in paper)DeepSpeed GitHub (project referenced in paper)

Risks & Boundaries

Limitations

Alpha release with limited model family support at time of paper (LLaMA, Mistral, OPT).

Benchmarks use synthetic prompt/generation distributions (normal with set means and 30% variance).

When Not To Use

Small CPU-only setups or non-NVIDIA hardware where DeepSpeed kernels don't apply.

Workloads dominated by tiny single-token prompts where batching gains are minimal.

Failure Modes

Uneven or adversarial prompt distributions may reduce the benefit of fixed target forward sizes.

Scheduling overheads could hurt very low-latency, single-request scenarios.

Core Entities

Models

LLaMALLaMA-2MistralOPT

Metrics

Effective throughput (queries/s)Latency (P50/P90/P95)Per-token generation latencyThroughput-latency curve

Benchmarks

Throughput-latency curvesEffective throughput (chat SLA: prompt 512 tokens/s, gen EMA 2/4/6 tokens/s)

Context Entities

Models

GPT-4 (citation)MPT-StoryWriter (context)