APEX: fast, extensible simulator that finds cost- and energy-efficient parallel plans for LLM serving

November 26, 20248 min

Overview

Decision SnapshotReady For Pilot

APEX is a practical tool: validated against live clusters, supports key parallelisms and quantizations, and is open-source. One-time profiling and some implementation gaps (e.g., EP quality varies) temper readiness.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, Fanny Nina Paravecino

Links

Abstract / PDF / Code

Why It Matters For Business

APEX lets ops teams find faster or cheaper LLM serving configurations without burning expensive GPU hours, enabling targeted trade-offs between latency, throughput, and energy while meeting SLOs.

Who Should Care

Summary TLDR

APEX is a simulator that searches and evaluates parallel execution plans for serving large language models. It models iteration-level batching (requests added dynamically as memory frees), operation-level profiled cost, and hybrid parallelisms (data/pipeline/tensor/expert). On evaluated traces and clusters APEX predicts performance with 10.7% average relative error, finds plans up to 3.37× faster than common heuristics, and can surface energy-optimal plans that cut energy by up to 45% when trading latency. APEX runs on a CPU, finds plans within ~15 minutes, and reduces time and monetary cost compared to live GPU testing (71× faster, ~1234× cheaper on reported setup). Code is available.

Problem Statement

Choosing how to parallelize an LLM across many devices is hard. Iteration-level batching makes batch sizes dynamic and interleaves prefill/decode work. The design space explodes with model size, cluster topology, quantization, and hybrid parallelisms. Exhaustive deployment testing is prohibitively slow and expensive, and static heuristics can be far from optimal in practice.

Main Contribution

APEX simulator that automatically generates and evaluates parallel execution plans for LLM serving.

Dynamism-aware simulation: models iteration-level batching and mixed prefill/decode stages.

Key Findings

APEX prediction fidelity is high.

Numbersaverage relative error = 10.7%

Practical UseUse APEX to compare plans — predicted relative differences are reliable; expect ~10% absolute prediction error on evaluated setups.

Evidence RefSection 4.3, Fig.6

Optimized parallel plans can greatly reduce latency vs common heuristics.

Numbersup to 3.37× speedup on evaluated traces and clusters

Practical UseRun APEX to search hybrid parallelisms — you may cut end-to-end latency multiple× versus heuristic TP-only rules.

Evidence RefAbstract, Table 2, Section 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Prediction error10.7% average relative errorevaluation tasks in Section 4.3Section 4.3 reports 10.7% average relative errorSection 4.3
Best latency speedup vs heuristic baseline3.37×heuristic TP-within-node + PP-across-nodesup to 3.37× faster on evaluated tracesTable 2 (various models & traces)Table 2 reports up to 3.37× improvement over baseline heuristicsTable 2, Section 4.2

What To Try In 7 Days

Run APEX with a 1–2k request trace from your service to compare heuristic vs optimized plans

Collect operation-level profiling for your cluster (one-time) and plug into APEX

Use APEX to find an energy-optimal plan under your SLO and test lowered GPU clock settings in staging

Optimization Features

Infra Optimization
GPU frequency tuningmulti-node cluster topologies
Model Optimization
quantization (W8A8, FP8, KV quantization)
System Optimization
device mapping aware of interconnect bandwidthenergy-vs-latency trade-off search
Inference Optimization
hybrid parallelism search (DP, PP, TP, EP)iteration-level batching modelingbatch-size constraint tuning

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Operation-level profiling omits smaller ops; APEX can under-estimate absolute latency compared to full deployment.

Some optimal plans require parallelism features not yet supported in production serving systems.

When Not To Use

When you need cycle-accurate, per-operation microbenchmarking on unprofiled hardware

If your deployment uses custom ops that are not captured by the Transformer IR without adding templates

Failure Modes

Prediction gaps when a parallelism's real implementation is poorly optimized (example: EP vs TP mismatch).

Underestimation of absolute latencies because non-key operation overheads are omitted.

Core Entities

Models

Llama-3.1-70BLlama-3.1-405BMistral-Large-123BMixtral-8x22B (MoE)Qwen2.5-32Bsynthetic trillion-scale (scaled Llama config)

Metrics

Time to first token (TTFT)Time per output token (TPOT)P95 latencyEnd-to-end latencyEnergy consumption (KJ)Model FLOPs UtilizationModel Bandwidth Utilization

Datasets

paper summarization trace (from [9])modified news-abstract Creation traceLMSYS-Chat-1M (Chat trace)