Overview
The dataset and benchmark are practical and already used in demos and an industry prototype, but results derive from one regional provider and replay methodology.
Citations7
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/7
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 40%
Why It Matters For Business
Using real LLM traces uncovers burst-driven failures and KV-cache pressure that synthetic tests miss, letting teams fix reliability and reduce wasted GPU costs before production.
Who Should Care
Summary TLDR
BurstGPT is an open dataset of 10.31 million real LLM service traces collected from an Azure OpenAI regional provider over 213 days. It captures request concurrency, conversation structure, request/response token lengths, and service failures for ChatGPT and GPT-4 APIs and conversational services. The authors provide BurstGPT-Perf, a lightweight benchmark and workload generator that replays or models observed burstiness (Gamma) and token-length distributions (Zipf). Demo evaluations show that real LLM workloads are far burstier than common synthetic traces, expose KV-cache pressure and higher failure rates (>5% for some services), and change which scheduling and disaggregation choices are最佳.
Problem Statement
Most LLM serving research uses synthetic or non-LLM traces that fail to reflect real request bursts, conversation patterns, variable response lengths, and failure behaviors. This mismatch can hide bottlenecks (KV cache, scheduling, disaggregation) and lead to poor performance when systems are deployed.
Main Contribution
BurstGPT dataset: 10.31M traces from Azure OpenAI regional GPT services over 213 days, with request times, token lengths, model and service type, and failures
BurstGPT-Perf: open-source workload generator and benchmark to replay scaled or modeled (Gamma/Zipf) LLM workloads
Key Findings
Real LLM traces are highly bursty and differ from common cloud/function workloads.
Dataset size and composition allow realistic evaluation across models and interfaces.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| dataset size (total traces) | 10.31M traces over 213 days | — | — | BurstGPT (Azure regional provider) | Abstract; Sec 3 | — |
| trace breakdown (by model & interface) | 8.69M ChatGPT API; 0.95M GPT-4 API; 0.30M ChatGPT conv; 0.16M GPT-4 conv | — | — | BurstGPT | Sec 3.1; Abstract | — |
What To Try In 7 Days
Replay a scaled BurstGPT period against your serving stack to find KV-cache and scheduling hotspots
Use modeled scaling (Gamma for arrivals, Zipf for token lengths) to emulate burstiness at your size
Run simple XGBoost predictors on request-count and mean-token features for 10-min autoscaling decisions
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Traces come from a single regional Azure OpenAI provider; usage patterns may differ across regions or customers
Cleaned trace excludes failure logs; some failure analysis requires the raw trace
When Not To Use
When you need model-quality benchmarks (accuracy, reasoning) rather than serving workload traces
For non-Azure or highly different user populations without validation against local traces
Failure Modes
KV cache memory bottlenecks during bursts causing request failures
Scheduling optimizations tuned on one service type (conversation) may degrade performance for another (API)

