APEX: fast, extensible simulator that finds cost- and energy-efficient parallel plans for LLM serving

Overview

Decision SnapshotReady For Pilot

APEX is a practical tool: validated against live clusters, supports key parallelisms and quantizations, and is open-source. One-time profiling and some implementation gaps (e.g., EP quality varies) temper readiness.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, Fanny Nina Paravecino

Links

Abstract / PDF / Code

Why It Matters For Business

APEX lets ops teams find faster or cheaper LLM serving configurations without burning expensive GPU hours, enabling targeted trade-offs between latency, throughput, and energy while meeting SLOs.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager Founder

Summary TLDR

APEX is a simulator that searches and evaluates parallel execution plans for serving large language models. It models iteration-level batching (requests added dynamically as memory frees), operation-level profiled cost, and hybrid parallelisms (data/pipeline/tensor/expert). On evaluated traces and clusters APEX predicts performance with 10.7% average relative error, finds plans up to 3.37× faster than common heuristics, and can surface energy-optimal plans that cut energy by up to 45% when trading latency. APEX runs on a CPU, finds plans within ~15 minutes, and reduces time and monetary cost compared to live GPU testing (71× faster, ~1234× cheaper on reported setup). Code is available.

Problem Statement

Choosing how to parallelize an LLM across many devices is hard. Iteration-level batching makes batch sizes dynamic and interleaves prefill/decode work. The design space explodes with model size, cluster topology, quantization, and hybrid parallelisms. Exhaustive deployment testing is prohibitively slow and expensive, and static heuristics can be far from optimal in practice.

Main Contribution

APEX simulator that automatically generates and evaluates parallel execution plans for LLM serving.

Dynamism-aware simulation: models iteration-level batching and mixed prefill/decode stages.

Key Findings

APEX prediction fidelity is high.

Numbersaverage relative error = 10.7%

Practical UseUse APEX to compare plans — predicted relative differences are reliable; expect ~10% absolute prediction error on evaluated setups.

Evidence RefSection 4.3, Fig.6

Optimized parallel plans can greatly reduce latency vs common heuristics.

Numbersup to 3.37× speedup on evaluated traces and clusters

Practical UseRun APEX to search hybrid parallelisms — you may cut end-to-end latency multiple× versus heuristic TP-only rules.

Evidence RefAbstract, Table 2, Section 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Prediction error	10.7% average relative error	—	—	evaluation tasks in Section 4.3	Section 4.3 reports 10.7% average relative error	Section 4.3
Best latency speedup vs heuristic baseline	3.37×	heuristic TP-within-node + PP-across-nodes	up to 3.37× faster on evaluated traces	Table 2 (various models & traces)	Table 2 reports up to 3.37× improvement over baseline heuristics	Table 2, Section 4.2

What To Try In 7 Days

Run APEX with a 1–2k request trace from your service to compare heuristic vs optimized plans

Collect operation-level profiling for your cluster (one-time) and plug into APEX

Use APEX to find an energy-optimal plan under your SLO and test lowered GPU clock settings in staging

Optimization Features

Infra Optimization

GPU frequency tuningmulti-node cluster topologies

Model Optimization

quantization (W8A8, FP8, KV quantization)

System Optimization

device mapping aware of interconnect bandwidthenergy-vs-latency trade-off search

Inference Optimization

hybrid parallelism search (DP, PP, TP, EP)iteration-level batching modelingbatch-size constraint tuning

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/microsoft/apex_plus

Risks & Boundaries

Limitations

Operation-level profiling omits smaller ops; APEX can under-estimate absolute latency compared to full deployment.

Some optimal plans require parallelism features not yet supported in production serving systems.

When Not To Use

When you need cycle-accurate, per-operation microbenchmarking on unprofiled hardware

If your deployment uses custom ops that are not captured by the Transformer IR without adding templates

Failure Modes

Prediction gaps when a parallelism's real implementation is poorly optimized (example: EP vs TP mismatch).

Underestimation of absolute latencies because non-key operation overheads are omitted.

Core Entities

Models

Llama-3.1-70BLlama-3.1-405BMistral-Large-123BMixtral-8x22B (MoE)Qwen2.5-32Bsynthetic trillion-scale (scaled Llama config)

Metrics

Time to first token (TTFT)Time per output token (TPOT)P95 latencyEnd-to-end latencyEnergy consumption (KJ)Model FLOPs UtilizationModel Bandwidth Utilization

Datasets

paper summarization trace (from [9])modified news-abstract Creation traceLMSYS-Chat-1M (Chat trace)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

APEX prediction fidelity is high.

Optimized parallel plans can greatly reduce latency vs common heuristics.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Move rollout work to cheap distributed GPUs and trade small policy lag for big cost savings.

Key finding

LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

Key finding

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

Key finding

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

Key finding