A runtime-driven simulator that models heterogeneous accelerators, disaggregated memory, batching, and power for realistic LLM serving

February 26, 20268 min

Overview

Decision SnapshotReady For Pilot

Validated against multiple real systems (RTX A6000, H100, TPU) with percent-level errors; relies on one-time operator profiling to generalize to new hardware.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/12

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Run realistic what-if tests for heterogeneous hardware and disaggregated serving to choose cheaper or more energy-efficient deployments before costly hardware changes.

Who Should Care

Summary TLDR

LLMServingSim 2.0 is a unified, runtime-driven simulator for modern LLM serving that explicitly models heterogeneous accelerators, multi-tier memory, disaggregated prefill/decode execution, MoE routing, prefix caching, and power. It uses operator-level profiles (collected via a PyTorch/HuggingFace profiler or external simulators) and a single runtime loop to capture batching, routing, placement, KV cache movement, and interconnect contention. Validations against real GPU and TPU deployments show sub-2% aggregated errors on throughput/latency/memory/power and practical simulation times, making it a practical tool to explore hardware-software co-design and energy tradeoffs before building real

Problem Statement

Existing simulators either model hardware microarchitecture or high-level serving policies, but not both together in a runtime-driven way. This gap prevents studying how heterogeneous accelerators, multi-tier memory, and disaggregated serving techniques interact at serving time to affect latency, throughput, and energy.

Main Contribution

A runtime-driven serving loop that jointly models software policies and heterogeneous hardware behavior, enabling interaction-aware evaluation of batching, routing, placement, caching, and offloading.

Profile-based operator modeling for easy integration of new accelerators and PIM devices using a single-device profiler or external profiles.

Key Findings

Simulator reproduces key serving metrics with very low average error across evaluated setups.

NumbersAverage error 0.97% across throughput, latency, memory, and power

Practical UseUse the simulator to estimate real-system throughput, latency, memory, and energy with percent-level accuracy before deploying hardware changes.

Evidence RefAbstract; VII.A

Time-series throughput tracking shows small per-step error on common GPUs.

NumbersPer-timestep throughput error 5.66% (RTX A6000) and 2.98% (H100); aggregated error 0.85% and 1.59%

Practical UseExpect close matching of dynamic behavior (bursts, batching phases) for GPU-based evaluation; use aggregated numbers for steady-state decisions.

Evidence RefVII.A Fig.5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracyavg error 0.97%Representative GPU/TPU workloadsAbstract; VII.AAbstract; VII.A
Per-timestep throughput error (RTX A6000)5.66%real RTX A6000 vLLMTime-series throughputVII.A Fig.5VII.A Fig.5

What To Try In 7 Days

Profile one model on a target device using the paper's operator profiler and run baseline simulations.

Simulate prefill-decode disaggregation to compare latency and network cost for your workload.

Try prefix caching/shared CPU cache to estimate hit-rate gains and memory savings.

Agent Features

Memory
KV cache modelingprefix cachingmulti-tier placementshared CPU/CXL prefix cache
Planning
batch schedulingoperator mappingplacement and offloadingprefill-decode routing
Tool Use
profile-based operator modelingPyTorch/HuggingFace profilervLLM integration
Frameworks
ASTRAsim extensionChakra extension
Architectures
heterogeneous acceleratorsmulti-tier memoryPIMCXL-attached memoryTPUGPU
Collaboration
multi-MSG sharingcross-instance prefix cache sharing

Optimization Features

Infra Optimization
disaggregation (prefill-decode)CXL memory poolingheterogeneous device poolsinterconnect contention modeling
System Optimization
device placementplacement-aware batchingKV cache placement and migrationpower-state management
Inference Optimization
operator-level offloadingsub-batch interleaving (SBI)MoEparallelism strategies (TP/PP/DP/EP)prefix caching and eviction policiesPIM-accelerated decode

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Higher simulation time than lightweight tools due to detailed runtime and memory modeling.

TPU validation limited to single-instance dense serving due to current vLLM-TPU support.

When Not To Use

For very quick, coarse estimates where percent-level accuracy is unnecessary and speed matters more.

When you lack operator-level profiles or the ability to collect them for target devices.

Failure Modes

Mismatched or stale operator profiles cause inaccurate latency/power estimates.

Unmodeled workload patterns (different request distributions) can produce larger errors than reported.

Core Entities

Models

Llama 3.1-8BLlama 3.1-70BMixtral 8x7BPhi-mini MoE

Metrics

throughputTTFTTPOTend-to-end latencyqueueing delaymemory usageprefix cache hit ratepowerenergywatts per token

Datasets

ShareGPT