A runtime-driven simulator that models heterogeneous accelerators, disaggregated memory, batching, and power for realistic LLM serving

February 26, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park

Links

Abstract / PDF

Why It Matters For Business

Run realistic what-if tests for heterogeneous hardware and disaggregated serving to choose cheaper or more energy-efficient deployments before costly hardware changes.

Summary TLDR

LLMServingSim 2.0 is a unified, runtime-driven simulator for modern LLM serving that explicitly models heterogeneous accelerators, multi-tier memory, disaggregated prefill/decode execution, MoE routing, prefix caching, and power. It uses operator-level profiles (collected via a PyTorch/HuggingFace profiler or external simulators) and a single runtime loop to capture batching, routing, placement, KV cache movement, and interconnect contention. Validations against real GPU and TPU deployments show sub-2% aggregated errors on throughput/latency/memory/power and practical simulation times, making it a practical tool to explore hardware-software co-design and energy tradeoffs before building real

Problem Statement

Existing simulators either model hardware microarchitecture or high-level serving policies, but not both together in a runtime-driven way. This gap prevents studying how heterogeneous accelerators, multi-tier memory, and disaggregated serving techniques interact at serving time to affect latency, throughput, and energy.

Main Contribution

A runtime-driven serving loop that jointly models software policies and heterogeneous hardware behavior, enabling interaction-aware evaluation of batching, routing, placement, caching, and offloading.

Profile-based operator modeling for easy integration of new accelerators and PIM devices using a single-device profiler or external profiles.

A system simulator that includes multi-tier KV cache placement, prefix caching, PD disaggregation (prefill/decode), MoE routing/offloading, and a three-state power model.

Validation against real GPU and TPU systems with low error and practical simulation times, plus public code on GitHub.

Key Findings

Simulator reproduces key serving metrics with very low average error across evaluated setups.

NumbersAverage error 0.97% across throughput, latency, memory, and power

Time-series throughput tracking shows small per-step error on common GPUs.

NumbersPer-timestep throughput error 5.66% (RTX A6000) and 2.98% (H100); aggregated error 0.85% and 1.59%

Power model closely matches measured system energy and power dynamics.

NumbersAverage energy error 1.34% (RTX A6000 experiments)

Memory and prefix cache dynamics are reproduced accurately, including multi-instance sharing.

NumbersAverage memory/prefix error 0.93% (single-instance) and 0.41% (multi-instance)

Simulator integrates non-GPU accelerators with acceptable accuracy when profiles exist.

NumbersTPU per-timestep throughput error 4.24%; aggregated error <0.2%

Processing-in-memory (PIM) can speed decoding and reduce energy in the evaluated case.

NumbersGPU+PIM achieved 1.43× throughput post-prefill and reduced watts-per-token by 32.3%

Results

Accuracy

Valueavg error 0.97%

Per-timestep throughput error (RTX A6000)

Value5.66%

Baselinereal RTX A6000 vLLM

Per-timestep throughput error (H100)

Value2.98%

Baselinereal H100 vLLM

Aggregated throughput/latency error (RTX A6000)

Value0.85%

Baselinereal RTX A6000 vLLM

Aggregated throughput/latency error (H100)

Value1.59%

Baselinereal H100 vLLM

Energy prediction error (RTX A6000)

Value1.34%

Baselinemeasured power/energy

Accuracy

Value0.93% error

Baselinereal RTX A6000

Accuracy

Value0.41% error

Baselinereal RTX A6000 with LMCache

TPU throughput per-timestep error

Value4.24%

Baselinereal TPU-v6e-1

PIM decode throughput improvement

Value1.43×

BaselineGPU-only system

Energy efficiency (watts per token) improvement with GPU+PIM

Value32.3% reduction

BaselineGPU-only system

Simulation time

Valuepractical times on order of minutes (≈10 min for complex configs)

Who Should Care

What To Try In 7 Days

Profile one model on a target device using the paper's operator profiler and run baseline simulations.

Simulate prefill-decode disaggregation to compare latency and network cost for your workload.

Try prefix caching/shared CPU cache to estimate hit-rate gains and memory savings.

Agent Features

Memory

  • KV cache modeling
  • prefix caching
  • multi-tier placement
  • shared CPU/CXL prefix cache

Planning

  • batch scheduling
  • operator mapping
  • placement and offloading
  • prefill-decode routing

Tool Use

  • profile-based operator modeling
  • PyTorch/HuggingFace profiler
  • vLLM integration

Frameworks

  • ASTRAsim extension
  • Chakra extension

Architectures

  • heterogeneous accelerators
  • multi-tier memory
  • PIM
  • CXL-attached memory
  • TPU
  • GPU

Collaboration

  • multi-MSG sharing
  • cross-instance prefix cache sharing

Optimization Features

Infra Optimization

  • disaggregation (prefill-decode)
  • CXL memory pooling
  • heterogeneous device pools
  • interconnect contention modeling

System Optimization

  • device placement
  • placement-aware batching
  • KV cache placement and migration
  • power-state management

Inference Optimization

  • operator-level offloading
  • sub-batch interleaving (SBI)
  • MoE
  • parallelism strategies (TP/PP/DP/EP)
  • prefix caching and eviction policies
  • PIM-accelerated decode

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Higher simulation time than lightweight tools due to detailed runtime and memory modeling.
  • TPU validation limited to single-instance dense serving due to current vLLM-TPU support.
  • Accuracy depends on quality and completeness of operator-level profiles collected once per device-model pair.
  • Network and large-scale cluster effects beyond evaluated topologies may need further validation.

When Not To Use

  • For very quick, coarse estimates where percent-level accuracy is unnecessary and speed matters more.
  • When you lack operator-level profiles or the ability to collect them for target devices.
  • For microarchitectural simulator-level design that needs cycle-accurate hardware details.

Failure Modes

  • Mismatched or stale operator profiles cause inaccurate latency/power estimates.
  • Unmodeled workload patterns (different request distributions) can produce larger errors than reported.
  • Simplified assumptions about global network topology may understate contention in very large clusters.

Core Entities

Models

  • Llama 3.1-8B
  • Llama 3.1-70B
  • Mixtral 8x7B
  • Phi-mini MoE

Metrics

  • throughput
  • TTFT
  • TPOT
  • end-to-end latency
  • queueing delay
  • memory usage
  • prefix cache hit rate
  • power
  • energy
  • watts per token

Datasets

  • ShareGPT