Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
7
Why It Matters For Business
Using real LLM traces uncovers burst-driven failures and KV-cache pressure that synthetic tests miss, letting teams fix reliability and reduce wasted GPU costs before production.
Summary TLDR
BurstGPT is an open dataset of 10.31 million real LLM service traces collected from an Azure OpenAI regional provider over 213 days. It captures request concurrency, conversation structure, request/response token lengths, and service failures for ChatGPT and GPT-4 APIs and conversational services. The authors provide BurstGPT-Perf, a lightweight benchmark and workload generator that replays or models observed burstiness (Gamma) and token-length distributions (Zipf). Demo evaluations show that real LLM workloads are far burstier than common synthetic traces, expose KV-cache pressure and higher failure rates (>5% for some services), and change which scheduling and disaggregation choices are最佳.
Problem Statement
Most LLM serving research uses synthetic or non-LLM traces that fail to reflect real request bursts, conversation patterns, variable response lengths, and failure behaviors. This mismatch can hide bottlenecks (KV cache, scheduling, disaggregation) and lead to poor performance when systems are deployed.
Main Contribution
BurstGPT dataset: 10.31M traces from Azure OpenAI regional GPT services over 213 days, with request times, token lengths, model and service type, and failures
BurstGPT-Perf: open-source workload generator and benchmark to replay scaled or modeled (Gamma/Zipf) LLM workloads
Analysis and demos showing burstiness, conversation statistics, response-length correlations, failure patterns, and concrete impacts on scheduling, KV-cache, and PD disaggregation
Key Findings
Real LLM traces are highly bursty and differ from common cloud/function workloads.
Dataset size and composition allow realistic evaluation across models and interfaces.
Conversation lengths are short and skewed: most conversations are small.
Failure rates are non-trivial and linked to burstiness and KV cache pressure.
Workload forecasting is practical and improves provisioning accuracy at coarser time scales.
Results
dataset size (total traces)
trace breakdown (by model & interface)
comparison of average RPS
conversation length distribution
service failure rates
Accuracy
prediction granularity effect
Who Should Care
What To Try In 7 Days
Replay a scaled BurstGPT period against your serving stack to find KV-cache and scheduling hotspots
Use modeled scaling (Gamma for arrivals, Zipf for token lengths) to emulate burstiness at your size
Run simple XGBoost predictors on request-count and mean-token features for 10-min autoscaling decisions
Optimization Features
Infra Optimization
- GPU utilization tuning
- instance-level scaling via RPS or modeled parameters
System Optimization
- workload-aware scheduling
- dynamic PD ratio
- workload provisioning and autoscaling
Inference Optimization
- prefill-decode (PD) disaggregation
- KV cache management
- request scheduling (FCFS/SRF/LRF)
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Traces come from a single regional Azure OpenAI provider; usage patterns may differ across regions or customers
- Cleaned trace excludes failure logs; some failure analysis requires the raw trace
- Closed-source model internals are not exposed; analysis relies on observable tokens, latencies and failures
- Modeled scaling (Gamma/Zipf) approximates but may not capture all real-world correlations
When Not To Use
- When you need model-quality benchmarks (accuracy, reasoning) rather than serving workload traces
- For non-Azure or highly different user populations without validation against local traces
- When strict private data constraints block using even anonymized production logs
Failure Modes
- KV cache memory bottlenecks during bursts causing request failures
- Scheduling optimizations tuned on one service type (conversation) may degrade performance for another (API)
- Short-term parameter shocks in burstiness (alpha changes) can cause sudden reliability drops
Core Entities
Models
- ChatGPT
- GPT-4
- Llama-2-13b-chat
- Llama-2-7b-chat
Metrics
- request failure rate
- token latency
- latency jitter (stddev)
- throughput
- NMAE
- NMSE
Datasets
- BurstGPT
- BurstGPT-Perf
- ShareGPT
Benchmarks
- BurstGPT-Perf

