Overview
Validated against multiple real systems (RTX A6000, H100, TPU) with percent-level errors; relies on one-time operator profiling to generalize to new hardware.
Citations0
Evidence Strength0.80
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/12
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Run realistic what-if tests for heterogeneous hardware and disaggregated serving to choose cheaper or more energy-efficient deployments before costly hardware changes.
Who Should Care
Summary TLDR
LLMServingSim 2.0 is a unified, runtime-driven simulator for modern LLM serving that explicitly models heterogeneous accelerators, multi-tier memory, disaggregated prefill/decode execution, MoE routing, prefix caching, and power. It uses operator-level profiles (collected via a PyTorch/HuggingFace profiler or external simulators) and a single runtime loop to capture batching, routing, placement, KV cache movement, and interconnect contention. Validations against real GPU and TPU deployments show sub-2% aggregated errors on throughput/latency/memory/power and practical simulation times, making it a practical tool to explore hardware-software co-design and energy tradeoffs before building real
Problem Statement
Existing simulators either model hardware microarchitecture or high-level serving policies, but not both together in a runtime-driven way. This gap prevents studying how heterogeneous accelerators, multi-tier memory, and disaggregated serving techniques interact at serving time to affect latency, throughput, and energy.
Main Contribution
A runtime-driven serving loop that jointly models software policies and heterogeneous hardware behavior, enabling interaction-aware evaluation of batching, routing, placement, caching, and offloading.
Profile-based operator modeling for easy integration of new accelerators and PIM devices using a single-device profiler or external profiles.
Key Findings
Simulator reproduces key serving metrics with very low average error across evaluated setups.
Time-series throughput tracking shows small per-step error on common GPUs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | avg error 0.97% | — | — | Representative GPU/TPU workloads | Abstract; VII.A | Abstract; VII.A |
| Per-timestep throughput error (RTX A6000) | 5.66% | real RTX A6000 vLLM | — | Time-series throughput | VII.A Fig.5 | VII.A Fig.5 |
What To Try In 7 Days
Profile one model on a target device using the paper's operator profiler and run baseline simulations.
Simulate prefill-decode disaggregation to compare latency and network cost for your workload.
Try prefix caching/shared CPU cache to estimate hit-rate gains and memory savings.
Agent Features
Memory
Planning
Tool Use
Frameworks
Architectures
Collaboration
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Higher simulation time than lightweight tools due to detailed runtime and memory modeling.
TPU validation limited to single-instance dense serving due to current vLLM-TPU support.
When Not To Use
For very quick, coarse estimates where percent-level accuracy is unnecessary and speed matters more.
When you lack operator-level profiles or the ability to collect them for target devices.
Failure Modes
Mismatched or stale operator profiles cause inaccurate latency/power estimates.
Unmodeled workload patterns (different request distributions) can produce larger errors than reported.

