BurstGPT: 10.3M real-world LLM traces (Azure) to test and tune serving systems

January 31, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

7

Authors

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, Xiaowen Chu

Links

Abstract / PDF

Why It Matters For Business

Using real LLM traces uncovers burst-driven failures and KV-cache pressure that synthetic tests miss, letting teams fix reliability and reduce wasted GPU costs before production.

Summary TLDR

BurstGPT is an open dataset of 10.31 million real LLM service traces collected from an Azure OpenAI regional provider over 213 days. It captures request concurrency, conversation structure, request/response token lengths, and service failures for ChatGPT and GPT-4 APIs and conversational services. The authors provide BurstGPT-Perf, a lightweight benchmark and workload generator that replays or models observed burstiness (Gamma) and token-length distributions (Zipf). Demo evaluations show that real LLM workloads are far burstier than common synthetic traces, expose KV-cache pressure and higher failure rates (>5% for some services), and change which scheduling and disaggregation choices are最佳.

Problem Statement

Most LLM serving research uses synthetic or non-LLM traces that fail to reflect real request bursts, conversation patterns, variable response lengths, and failure behaviors. This mismatch can hide bottlenecks (KV cache, scheduling, disaggregation) and lead to poor performance when systems are deployed.

Main Contribution

BurstGPT dataset: 10.31M traces from Azure OpenAI regional GPT services over 213 days, with request times, token lengths, model and service type, and failures

BurstGPT-Perf: open-source workload generator and benchmark to replay scaled or modeled (Gamma/Zipf) LLM workloads

Analysis and demos showing burstiness, conversation statistics, response-length correlations, failure patterns, and concrete impacts on scheduling, KV-cache, and PD disaggregation

Key Findings

Real LLM traces are highly bursty and differ from common cloud/function workloads.

NumbersMean RPS: MAF 1.64 vs LLM conv 0.019, LLM API 0.21 (ChatGPT)

Dataset size and composition allow realistic evaluation across models and interfaces.

Numbers10.31M traces over 213 days: 8.69M ChatGPT API, 0.95M GPT-4 API, 0.30M ChatGPT conv, 0.16M GPT-4 conv

Conversation lengths are short and skewed: most conversations are small.

Numbers35% conversations have one request; median 2, 75% ≤ 4 requests

Failure rates are non-trivial and linked to burstiness and KV cache pressure.

NumbersConversation service failure rate for ChatGPT >5% on average; system-level spikes reported

Workload forecasting is practical and improves provisioning accuracy at coarser time scales.

NumbersRequest-count prediction NMSE: 0.50 (1 min) vs 0.21 (10 min); NMAE 0.66 (1 min) vs 0.32 (10 min)

Results

dataset size (total traces)

Value10.31M traces over 213 days

trace breakdown (by model & interface)

Value8.69M ChatGPT API; 0.95M GPT-4 API; 0.30M ChatGPT conv; 0.16M GPT-4 conv

comparison of average RPS

ValueMAF 1.64 RPS vs LLM conv 0.019 RPS, LLM API 0.21 RPS (ChatGPT)

BaselineMAF non-LLM workload

conversation length distribution

Value35% single-request conversations; median 2; 75% ≤ 4 requests

service failure rates

ValueChatGPT conversation failure rate >5% on average

Baselineregular cloud services (much lower)

Accuracy

ValueRequest-count prediction: NMAE 0.73, NMSE 0.48; mean-token prediction: NMAE 0.72, NMSE 0.67

prediction granularity effect

Value1-min bins NMSE 0.50 vs 10-min bins NMSE 0.21 (request count)

Who Should Care

What To Try In 7 Days

Replay a scaled BurstGPT period against your serving stack to find KV-cache and scheduling hotspots

Use modeled scaling (Gamma for arrivals, Zipf for token lengths) to emulate burstiness at your size

Run simple XGBoost predictors on request-count and mean-token features for 10-min autoscaling decisions

Optimization Features

Infra Optimization

  • GPU utilization tuning
  • instance-level scaling via RPS or modeled parameters

System Optimization

  • workload-aware scheduling
  • dynamic PD ratio
  • workload provisioning and autoscaling

Inference Optimization

  • prefill-decode (PD) disaggregation
  • KV cache management
  • request scheduling (FCFS/SRF/LRF)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Traces come from a single regional Azure OpenAI provider; usage patterns may differ across regions or customers
  • Cleaned trace excludes failure logs; some failure analysis requires the raw trace
  • Closed-source model internals are not exposed; analysis relies on observable tokens, latencies and failures
  • Modeled scaling (Gamma/Zipf) approximates but may not capture all real-world correlations

When Not To Use

  • When you need model-quality benchmarks (accuracy, reasoning) rather than serving workload traces
  • For non-Azure or highly different user populations without validation against local traces
  • When strict private data constraints block using even anonymized production logs

Failure Modes

  • KV cache memory bottlenecks during bursts causing request failures
  • Scheduling optimizations tuned on one service type (conversation) may degrade performance for another (API)
  • Short-term parameter shocks in burstiness (alpha changes) can cause sudden reliability drops

Core Entities

Models

  • ChatGPT
  • GPT-4
  • Llama-2-13b-chat
  • Llama-2-7b-chat

Metrics

  • request failure rate
  • token latency
  • latency jitter (stddev)
  • throughput
  • NMAE
  • NMSE

Datasets

  • BurstGPT
  • BurstGPT-Perf
  • ShareGPT

Benchmarks

  • BurstGPT-Perf