A runtime-driven simulator that models heterogeneous accelerators, disaggregated memory, batching, and power for realistic LLM serving

Overview

Decision SnapshotReady For Pilot

Validated against multiple real systems (RTX A6000, H100, TPU) with percent-level errors; relies on one-time operator profiling to generalize to new hardware.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/12

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Run realistic what-if tests for heterogeneous hardware and disaggregated serving to choose cheaper or more energy-efficient deployments before costly hardware changes.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

LLMServingSim 2.0 is a unified, runtime-driven simulator for modern LLM serving that explicitly models heterogeneous accelerators, multi-tier memory, disaggregated prefill/decode execution, MoE routing, prefix caching, and power. It uses operator-level profiles (collected via a PyTorch/HuggingFace profiler or external simulators) and a single runtime loop to capture batching, routing, placement, KV cache movement, and interconnect contention. Validations against real GPU and TPU deployments show sub-2% aggregated errors on throughput/latency/memory/power and practical simulation times, making it a practical tool to explore hardware-software co-design and energy tradeoffs before building real

Problem Statement

Existing simulators either model hardware microarchitecture or high-level serving policies, but not both together in a runtime-driven way. This gap prevents studying how heterogeneous accelerators, multi-tier memory, and disaggregated serving techniques interact at serving time to affect latency, throughput, and energy.

Main Contribution

A runtime-driven serving loop that jointly models software policies and heterogeneous hardware behavior, enabling interaction-aware evaluation of batching, routing, placement, caching, and offloading.

Profile-based operator modeling for easy integration of new accelerators and PIM devices using a single-device profiler or external profiles.

Key Findings

Simulator reproduces key serving metrics with very low average error across evaluated setups.

NumbersAverage error 0.97% across throughput, latency, memory, and power

Practical UseUse the simulator to estimate real-system throughput, latency, memory, and energy with percent-level accuracy before deploying hardware changes.

Evidence RefAbstract; VII.A

Time-series throughput tracking shows small per-step error on common GPUs.

NumbersPer-timestep throughput error 5.66% (RTX A6000) and 2.98% (H100); aggregated error 0.85% and 1.59%

Practical UseExpect close matching of dynamic behavior (bursts, batching phases) for GPU-based evaluation; use aggregated numbers for steady-state decisions.

Evidence RefVII.A Fig.5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	avg error 0.97%	—	—	Representative GPU/TPU workloads	Abstract; VII.A	Abstract; VII.A
Per-timestep throughput error (RTX A6000)	5.66%	real RTX A6000 vLLM	—	Time-series throughput	VII.A Fig.5	VII.A Fig.5

What To Try In 7 Days

Profile one model on a target device using the paper's operator profiler and run baseline simulations.

Simulate prefill-decode disaggregation to compare latency and network cost for your workload.

Try prefix caching/shared CPU cache to estimate hit-rate gains and memory savings.

Agent Features

Memory

KV cache modelingprefix cachingmulti-tier placementshared CPU/CXL prefix cache

Planning

batch schedulingoperator mappingplacement and offloadingprefill-decode routing

Tool Use

profile-based operator modelingPyTorch/HuggingFace profilervLLM integration

Frameworks

ASTRAsim extensionChakra extension

Architectures

heterogeneous acceleratorsmulti-tier memoryPIMCXL-attached memoryTPUGPU

Collaboration

multi-MSG sharingcross-instance prefix cache sharing

Optimization Features

Infra Optimization

disaggregation (prefill-decode)CXL memory poolingheterogeneous device poolsinterconnect contention modeling

System Optimization

device placementplacement-aware batchingKV cache placement and migrationpower-state management

Inference Optimization

operator-level offloadingsub-batch interleaving (SBI)MoEparallelism strategies (TP/PP/DP/EP)prefix caching and eviction policiesPIM-accelerated decode

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/casys-kaist/LLMServingSim

Data URLs

https://sharegpt.com

Risks & Boundaries

Limitations

Higher simulation time than lightweight tools due to detailed runtime and memory modeling.

TPU validation limited to single-instance dense serving due to current vLLM-TPU support.

When Not To Use

For very quick, coarse estimates where percent-level accuracy is unnecessary and speed matters more.

When you lack operator-level profiles or the ability to collect them for target devices.

Failure Modes

Mismatched or stale operator profiles cause inaccurate latency/power estimates.

Unmodeled workload patterns (different request distributions) can produce larger errors than reported.

Core Entities

Models

Llama 3.1-8BLlama 3.1-70BMixtral 8x7BPhi-mini MoE

Metrics

throughputTTFTTPOTend-to-end latencyqueueing delaymemory usageprefix cache hit ratepowerenergywatts per token

Datasets

ShareGPT

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Simulator reproduces key serving metrics with very low average error across evaluated setups.

Time-series throughput tracking shows small per-step error on common GPUs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Move rollout work to cheap distributed GPUs and trade small policy lag for big cost savings.

Key finding

LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

Key finding

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

Key finding

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

Key finding