DeepSpeed-FastGen: up to 2.3x effective throughput and much lower tail latency for LLM serving

Overview

Decision SnapshotReady For Pilot

Strong systems engineering and measured speedups on multiple GPUs give practical value now, but results are from an alpha release and synthetic workload distributions.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 80%

Novelty: 45%

Authors

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He

Links

Abstract / PDF / Code

Why It Matters For Business

FastGen can double effective throughput and sharply cut tail latency on long-prompt, streaming chat workloads, which lowers GPU cost per successful request and improves user-perceived responsiveness.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

DeepSpeed-FastGen is a serving system that mixes a new scheduling strategy called Dynamic SplitFuse with DeepSpeed-MII and DeepSpeed-Inference. It decomposes long prompts and fuses short prompts to keep GPU forward passes at a consistent token size, which improves utilization and responsiveness. Reported gains vs. vLLM: up to 2.3x effective throughput, ~2x average latency reduction, and up to 3.7x reduction in P95 token latency on LLaMA-family models across A100/H100/A6000 hardware. Alpha release supports LLaMA/LLaMA-2, Mistral, and OPT and includes both short-lived and persistent deployment modes.

Problem Statement

Existing LLM serving stacks stall or preempt generation when handling long prompts, which hurts throughput and raises tail latency. Systems either do large prompt-only forwards or interrupt generation to process prompts, causing inconsistent latency and poor utilization. The paper proposes a scheduling/composition fix to avoid these stalls.

Main Contribution

Dynamic SplitFuse: a token composition policy that splits long prompts and fuses short prompts so forward passes keep a target token size.

A production-ready implementation combining DeepSpeed-MII and DeepSpeed-Inference with tuned kernels and blocked (non-contiguous) KV cache support.

Key Findings

Effective throughput improved up to 2.3× versus vLLM under chat-style SLAs.

Numbersup to 2.3x effective throughput (Section 4.3)

Practical UseReplace or benchmark FastGen in front of chat services to increase successful requests per GPU under same SLA.

Evidence RefSection 4.3, Figure 6

Average latency reduced by about 2× on some configurations.

Numbers2x lower average latency (abstract; Section 4.2)

Practical UseExpect noticeably faster first-token and interactive responses for long-prompt workloads.

Evidence RefSection 4.2, Llama-2 70B experiment

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Effective throughput (chat SLA)	up to 2.3×	vLLM	×2.3	Llama-2 7B/13B/70B across A100/H100/A6000	Section 4.3; plotted in Figure 6	Section 4.3, Figure 6
Average latency	≈2× lower	vLLM	2x	Llama-2 70B (4 A100s)	Section 4.2; 1.36 rps vs 0.67 rps at same latency	Section 4.2

What To Try In 7 Days

Install DeepSpeed-MII (pip) and run the provided pipeline example.

Compare effective throughput vs. your current vLLM/TGI setup on a representative prompt distribution.

Test persistent deployment with the built-in GRPC server and load balancer on 1–4 nodes.

Optimization Features

Token Efficiency

Composes prompt and generation tokens to meet target forward size

Infra Optimization

Supports Tensor Parallelism and Ampere+ GPUs (A100/H100/A6000)Works with DeepSpeed-MII backend for model hosting

System Optimization

Replica-level load balancing for near-linear scalingPrecompiled DeepSpeed-Kernels to reduce install/compile friction

Inference Optimization

Dynamic SplitFuse continuous batching (split long prompts, fuse short prompts)Blocked/non-contiguous KV cache to reduce memory fragmentationOptimized CUDA kernels (FlashAttention-derived)Per-token forward sizing to keep GPU in throughput-saturated region

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

DeepSpeed-MII GitHub landing page (referenced in paper)DeepSpeed GitHub (project referenced in paper)

Risks & Boundaries

Limitations

Alpha release with limited model family support at time of paper (LLaMA, Mistral, OPT).

Benchmarks use synthetic prompt/generation distributions (normal with set means and 30% variance).

When Not To Use

Small CPU-only setups or non-NVIDIA hardware where DeepSpeed kernels don't apply.

Workloads dominated by tiny single-token prompts where batching gains are minimal.

Failure Modes

Uneven or adversarial prompt distributions may reduce the benefit of fixed target forward sizes.

Scheduling overheads could hurt very low-latency, single-request scenarios.

Core Entities

Models

LLaMALLaMA-2MistralOPT

Metrics

Effective throughput (queries/s)Latency (P50/P90/P95)Per-token generation latencyThroughput-latency curve

Benchmarks

Throughput-latency curvesEffective throughput (chat SLA: prompt 512 tokens/s, gen EMA 2/4/6 tokens/s)

Context Entities

Models

GPT-4 (citation)MPT-StoryWriter (context)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Effective throughput improved up to 2.3× versus vLLM under chat-style SLAs.

Average latency reduced by about 2× on some configurations.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

Context Entities

Models

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding