Overview
Strong systems engineering and measured speedups on multiple GPUs give practical value now, but results are from an alpha release and synthetic workload distributions.
Citations5
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 65%
Production readiness: 80%
Novelty: 45%
Why It Matters For Business
FastGen can double effective throughput and sharply cut tail latency on long-prompt, streaming chat workloads, which lowers GPU cost per successful request and improves user-perceived responsiveness.
Who Should Care
Summary TLDR
DeepSpeed-FastGen is a serving system that mixes a new scheduling strategy called Dynamic SplitFuse with DeepSpeed-MII and DeepSpeed-Inference. It decomposes long prompts and fuses short prompts to keep GPU forward passes at a consistent token size, which improves utilization and responsiveness. Reported gains vs. vLLM: up to 2.3x effective throughput, ~2x average latency reduction, and up to 3.7x reduction in P95 token latency on LLaMA-family models across A100/H100/A6000 hardware. Alpha release supports LLaMA/LLaMA-2, Mistral, and OPT and includes both short-lived and persistent deployment modes.
Problem Statement
Existing LLM serving stacks stall or preempt generation when handling long prompts, which hurts throughput and raises tail latency. Systems either do large prompt-only forwards or interrupt generation to process prompts, causing inconsistent latency and poor utilization. The paper proposes a scheduling/composition fix to avoid these stalls.
Main Contribution
Dynamic SplitFuse: a token composition policy that splits long prompts and fuses short prompts so forward passes keep a target token size.
A production-ready implementation combining DeepSpeed-MII and DeepSpeed-Inference with tuned kernels and blocked (non-contiguous) KV cache support.
Key Findings
Effective throughput improved up to 2.3× versus vLLM under chat-style SLAs.
Average latency reduced by about 2× on some configurations.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Effective throughput (chat SLA) | up to 2.3× | vLLM | ×2.3 | Llama-2 7B/13B/70B across A100/H100/A6000 | Section 4.3; plotted in Figure 6 | Section 4.3, Figure 6 |
| Average latency | ≈2× lower | vLLM | 2x | Llama-2 70B (4 A100s) | Section 4.2; 1.36 rps vs 0.67 rps at same latency | Section 4.2 |
What To Try In 7 Days
Install DeepSpeed-MII (pip) and run the provided pipeline example.
Compare effective throughput vs. your current vLLM/TGI setup on a representative prompt distribution.
Test persistent deployment with the built-in GRPC server and load balancer on 1–4 nodes.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Alpha release with limited model family support at time of paper (LLaMA, Mistral, OPT).
Benchmarks use synthetic prompt/generation distributions (normal with set means and 30% variance).
When Not To Use
Small CPU-only setups or non-NVIDIA hardware where DeepSpeed kernels don't apply.
Workloads dominated by tiny single-token prompts where batching gains are minimal.
Failure Modes
Uneven or adversarial prompt distributions may reduce the benefit of fixed target forward sizes.
Scheduling overheads could hurt very low-latency, single-request scenarios.

