BurstGPT: 10.3M real-world LLM traces (Azure) to test and tune serving systems

January 31, 20248 min

Overview

Decision SnapshotReady For Pilot

The dataset and benchmark are practical and already used in demos and an industry prototype, but results derive from one regional provider and replay methodology.

Citations7

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, Xiaowen Chu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Using real LLM traces uncovers burst-driven failures and KV-cache pressure that synthetic tests miss, letting teams fix reliability and reduce wasted GPU costs before production.

Who Should Care

Summary TLDR

BurstGPT is an open dataset of 10.31 million real LLM service traces collected from an Azure OpenAI regional provider over 213 days. It captures request concurrency, conversation structure, request/response token lengths, and service failures for ChatGPT and GPT-4 APIs and conversational services. The authors provide BurstGPT-Perf, a lightweight benchmark and workload generator that replays or models observed burstiness (Gamma) and token-length distributions (Zipf). Demo evaluations show that real LLM workloads are far burstier than common synthetic traces, expose KV-cache pressure and higher failure rates (>5% for some services), and change which scheduling and disaggregation choices are最佳.

Problem Statement

Most LLM serving research uses synthetic or non-LLM traces that fail to reflect real request bursts, conversation patterns, variable response lengths, and failure behaviors. This mismatch can hide bottlenecks (KV cache, scheduling, disaggregation) and lead to poor performance when systems are deployed.

Main Contribution

BurstGPT dataset: 10.31M traces from Azure OpenAI regional GPT services over 213 days, with request times, token lengths, model and service type, and failures

BurstGPT-Perf: open-source workload generator and benchmark to replay scaled or modeled (Gamma/Zipf) LLM workloads

Key Findings

Real LLM traces are highly bursty and differ from common cloud/function workloads.

NumbersMean RPS: MAF 1.64 vs LLM conv 0.019, LLM API 0.21 (ChatGPT)

Practical UseDo not rely on non-LLM workloads or simple uniform RPS when testing LLM serving; use bursty traces to reveal real bottlenecks

Evidence RefSec 2.1, Table 1

Dataset size and composition allow realistic evaluation across models and interfaces.

Numbers10.31M traces over 213 days: 8.69M ChatGPT API, 0.95M GPT-4 API, 0.30M ChatGPT conv, 0.16M GPT-4 conv

Practical UseUse the dataset subsets to stress-test both API and conversational flows rather than synthetic samples

Evidence RefAbstract; Sec 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
dataset size (total traces)10.31M traces over 213 daysBurstGPT (Azure regional provider)Abstract; Sec 3
trace breakdown (by model & interface)8.69M ChatGPT API; 0.95M GPT-4 API; 0.30M ChatGPT conv; 0.16M GPT-4 convBurstGPTSec 3.1; Abstract

What To Try In 7 Days

Replay a scaled BurstGPT period against your serving stack to find KV-cache and scheduling hotspots

Use modeled scaling (Gamma for arrivals, Zipf for token lengths) to emulate burstiness at your size

Run simple XGBoost predictors on request-count and mean-token features for 10-min autoscaling decisions

Optimization Features

Infra Optimization
GPU utilization tuninginstance-level scaling via RPS or modeled parameters
System Optimization
workload-aware schedulingdynamic PD ratioworkload provisioning and autoscaling
Inference Optimization
prefill-decode (PD) disaggregationKV cache managementrequest scheduling (FCFS/SRF/LRF)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Traces come from a single regional Azure OpenAI provider; usage patterns may differ across regions or customers

Cleaned trace excludes failure logs; some failure analysis requires the raw trace

When Not To Use

When you need model-quality benchmarks (accuracy, reasoning) rather than serving workload traces

For non-Azure or highly different user populations without validation against local traces

Failure Modes

KV cache memory bottlenecks during bursts causing request failures

Scheduling optimizations tuned on one service type (conversation) may degrade performance for another (API)

Core Entities

Models

ChatGPTGPT-4Llama-2-13b-chatLlama-2-7b-chat

Metrics

request failure ratetoken latencylatency jitter (stddev)throughputNMAENMSE

Datasets

BurstGPTBurstGPT-PerfShareGPT

Benchmarks

BurstGPT-Perf