Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Run realistic what-if tests for heterogeneous hardware and disaggregated serving to choose cheaper or more energy-efficient deployments before costly hardware changes.
Summary TLDR
LLMServingSim 2.0 is a unified, runtime-driven simulator for modern LLM serving that explicitly models heterogeneous accelerators, multi-tier memory, disaggregated prefill/decode execution, MoE routing, prefix caching, and power. It uses operator-level profiles (collected via a PyTorch/HuggingFace profiler or external simulators) and a single runtime loop to capture batching, routing, placement, KV cache movement, and interconnect contention. Validations against real GPU and TPU deployments show sub-2% aggregated errors on throughput/latency/memory/power and practical simulation times, making it a practical tool to explore hardware-software co-design and energy tradeoffs before building real
Problem Statement
Existing simulators either model hardware microarchitecture or high-level serving policies, but not both together in a runtime-driven way. This gap prevents studying how heterogeneous accelerators, multi-tier memory, and disaggregated serving techniques interact at serving time to affect latency, throughput, and energy.
Main Contribution
A runtime-driven serving loop that jointly models software policies and heterogeneous hardware behavior, enabling interaction-aware evaluation of batching, routing, placement, caching, and offloading.
Profile-based operator modeling for easy integration of new accelerators and PIM devices using a single-device profiler or external profiles.
A system simulator that includes multi-tier KV cache placement, prefix caching, PD disaggregation (prefill/decode), MoE routing/offloading, and a three-state power model.
Validation against real GPU and TPU systems with low error and practical simulation times, plus public code on GitHub.
Key Findings
Simulator reproduces key serving metrics with very low average error across evaluated setups.
Time-series throughput tracking shows small per-step error on common GPUs.
Power model closely matches measured system energy and power dynamics.
Memory and prefix cache dynamics are reproduced accurately, including multi-instance sharing.
Simulator integrates non-GPU accelerators with acceptable accuracy when profiles exist.
Processing-in-memory (PIM) can speed decoding and reduce energy in the evaluated case.
Results
Accuracy
Per-timestep throughput error (RTX A6000)
Per-timestep throughput error (H100)
Aggregated throughput/latency error (RTX A6000)
Aggregated throughput/latency error (H100)
Energy prediction error (RTX A6000)
Accuracy
Accuracy
TPU throughput per-timestep error
PIM decode throughput improvement
Energy efficiency (watts per token) improvement with GPU+PIM
Simulation time
Who Should Care
What To Try In 7 Days
Profile one model on a target device using the paper's operator profiler and run baseline simulations.
Simulate prefill-decode disaggregation to compare latency and network cost for your workload.
Try prefix caching/shared CPU cache to estimate hit-rate gains and memory savings.
Agent Features
Memory
- KV cache modeling
- prefix caching
- multi-tier placement
- shared CPU/CXL prefix cache
Planning
- batch scheduling
- operator mapping
- placement and offloading
- prefill-decode routing
Tool Use
- profile-based operator modeling
- PyTorch/HuggingFace profiler
- vLLM integration
Frameworks
- ASTRAsim extension
- Chakra extension
Architectures
- heterogeneous accelerators
- multi-tier memory
- PIM
- CXL-attached memory
- TPU
- GPU
Collaboration
- multi-MSG sharing
- cross-instance prefix cache sharing
Optimization Features
Infra Optimization
- disaggregation (prefill-decode)
- CXL memory pooling
- heterogeneous device pools
- interconnect contention modeling
System Optimization
- device placement
- placement-aware batching
- KV cache placement and migration
- power-state management
Inference Optimization
- operator-level offloading
- sub-batch interleaving (SBI)
- MoE
- parallelism strategies (TP/PP/DP/EP)
- prefix caching and eviction policies
- PIM-accelerated decode
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Higher simulation time than lightweight tools due to detailed runtime and memory modeling.
- TPU validation limited to single-instance dense serving due to current vLLM-TPU support.
- Accuracy depends on quality and completeness of operator-level profiles collected once per device-model pair.
- Network and large-scale cluster effects beyond evaluated topologies may need further validation.
When Not To Use
- For very quick, coarse estimates where percent-level accuracy is unnecessary and speed matters more.
- When you lack operator-level profiles or the ability to collect them for target devices.
- For microarchitectural simulator-level design that needs cycle-accurate hardware details.
Failure Modes
- Mismatched or stale operator profiles cause inaccurate latency/power estimates.
- Unmodeled workload patterns (different request distributions) can produce larger errors than reported.
- Simplified assumptions about global network topology may understate contention in very large clusters.
Core Entities
Models
- Llama 3.1-8B
- Llama 3.1-70B
- Mixtral 8x7B
- Phi-mini MoE
Metrics
- throughput
- TTFT
- TPOT
- end-to-end latency
- queueing delay
- memory usage
- prefix cache hit rate
- power
- energy
- watts per token
Datasets
- ShareGPT

