BurstGPT: 10.3M real-world LLM traces (Azure) to test and tune serving systems

Overview

Decision SnapshotReady For Pilot

The dataset and benchmark are practical and already used in demos and an industry prototype, but results derive from one regional provider and replay methodology.

Citations7

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, Xiaowen Chu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Using real LLM traces uncovers burst-driven failures and KV-cache pressure that synthetic tests miss, letting teams fix reliability and reduce wasted GPU costs before production.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

BurstGPT is an open dataset of 10.31 million real LLM service traces collected from an Azure OpenAI regional provider over 213 days. It captures request concurrency, conversation structure, request/response token lengths, and service failures for ChatGPT and GPT-4 APIs and conversational services. The authors provide BurstGPT-Perf, a lightweight benchmark and workload generator that replays or models observed burstiness (Gamma) and token-length distributions (Zipf). Demo evaluations show that real LLM workloads are far burstier than common synthetic traces, expose KV-cache pressure and higher failure rates (>5% for some services), and change which scheduling and disaggregation choices are最佳.

Problem Statement

Most LLM serving research uses synthetic or non-LLM traces that fail to reflect real request bursts, conversation patterns, variable response lengths, and failure behaviors. This mismatch can hide bottlenecks (KV cache, scheduling, disaggregation) and lead to poor performance when systems are deployed.

Main Contribution

BurstGPT dataset: 10.31M traces from Azure OpenAI regional GPT services over 213 days, with request times, token lengths, model and service type, and failures

BurstGPT-Perf: open-source workload generator and benchmark to replay scaled or modeled (Gamma/Zipf) LLM workloads

Key Findings

Real LLM traces are highly bursty and differ from common cloud/function workloads.

NumbersMean RPS: MAF 1.64 vs LLM conv 0.019, LLM API 0.21 (ChatGPT)

Practical UseDo not rely on non-LLM workloads or simple uniform RPS when testing LLM serving; use bursty traces to reveal real bottlenecks

Evidence RefSec 2.1, Table 1

Dataset size and composition allow realistic evaluation across models and interfaces.

Numbers10.31M traces over 213 days: 8.69M ChatGPT API, 0.95M GPT-4 API, 0.30M ChatGPT conv, 0.16M GPT-4 conv

Practical UseUse the dataset subsets to stress-test both API and conversational flows rather than synthetic samples

Evidence RefAbstract; Sec 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
dataset size (total traces)	10.31M traces over 213 days	—	—	BurstGPT (Azure regional provider)	Abstract; Sec 3	—
trace breakdown (by model & interface)	8.69M ChatGPT API; 0.95M GPT-4 API; 0.30M ChatGPT conv; 0.16M GPT-4 conv	—	—	BurstGPT	Sec 3.1; Abstract	—

What To Try In 7 Days

Replay a scaled BurstGPT period against your serving stack to find KV-cache and scheduling hotspots

Use modeled scaling (Gamma for arrivals, Zipf for token lengths) to emulate burstiness at your size

Run simple XGBoost predictors on request-count and mean-token features for 10-min autoscaling decisions

Optimization Features

Infra Optimization

GPU utilization tuninginstance-level scaling via RPS or modeled parameters

System Optimization

workload-aware schedulingdynamic PD ratioworkload provisioning and autoscaling

Inference Optimization

prefill-decode (PD) disaggregationKV cache managementrequest scheduling (FCFS/SRF/LRF)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/HPMLL/BurstGPT

Data URLs

https://github.com/HPMLL/BurstGPT

Risks & Boundaries

Limitations

Traces come from a single regional Azure OpenAI provider; usage patterns may differ across regions or customers

Cleaned trace excludes failure logs; some failure analysis requires the raw trace

When Not To Use

When you need model-quality benchmarks (accuracy, reasoning) rather than serving workload traces

For non-Azure or highly different user populations without validation against local traces

Failure Modes

KV cache memory bottlenecks during bursts causing request failures

Scheduling optimizations tuned on one service type (conversation) may degrade performance for another (API)

Core Entities

Models

ChatGPTGPT-4Llama-2-13b-chatLlama-2-7b-chat

Metrics

request failure ratetoken latencylatency jitter (stddev)throughputNMAENMSE

Datasets

BurstGPTBurstGPT-PerfShareGPT

Benchmarks

BurstGPT-Perf

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Real LLM traces are highly bursty and differ from common cloud/function workloads.

Dataset size and composition allow realistic evaluation across models and interfaces.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding