APEX: fast, extensible simulator that finds cost- and energy-efficient parallel plans for LLM serving

November 26, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, Fanny Nina Paravecino

Links

Abstract / PDF

Why It Matters For Business

APEX lets ops teams find faster or cheaper LLM serving configurations without burning expensive GPU hours, enabling targeted trade-offs between latency, throughput, and energy while meeting SLOs.

Summary TLDR

APEX is a simulator that searches and evaluates parallel execution plans for serving large language models. It models iteration-level batching (requests added dynamically as memory frees), operation-level profiled cost, and hybrid parallelisms (data/pipeline/tensor/expert). On evaluated traces and clusters APEX predicts performance with 10.7% average relative error, finds plans up to 3.37× faster than common heuristics, and can surface energy-optimal plans that cut energy by up to 45% when trading latency. APEX runs on a CPU, finds plans within ~15 minutes, and reduces time and monetary cost compared to live GPU testing (71× faster, ~1234× cheaper on reported setup). Code is available.

Problem Statement

Choosing how to parallelize an LLM across many devices is hard. Iteration-level batching makes batch sizes dynamic and interleaves prefill/decode work. The design space explodes with model size, cluster topology, quantization, and hybrid parallelisms. Exhaustive deployment testing is prohibitively slow and expensive, and static heuristics can be far from optimal in practice.

Main Contribution

APEX simulator that automatically generates and evaluates parallel execution plans for LLM serving.

Dynamism-aware simulation: models iteration-level batching and mixed prefill/decode stages.

Operation-level profiling + Transformer IR to scale simulation to billion- and trillion-scale models.

Supports hybrid parallelisms (DP, PP, TP, EP), quantizations, and diverse cluster topologies; modular for extension.

Demonstrated gains: accurate predictions (avg. 10.7% error), up to 3.37× latency speedup, and large energy/cost savings.

Key Findings

APEX prediction fidelity is high.

Numbersaverage relative error = 10.7%

Optimized parallel plans can greatly reduce latency vs common heuristics.

Numbersup to 3.37× speedup on evaluated traces and clusters

Energy-focused plans can cut energy significantly by trading latency or clock frequency.

Numbersup to 45% energy reduction (when lowering GPU freq); up to 19% vs latency-opt without freq change

Simulation is much faster and cheaper than live evaluation.

Numbers71× faster; 1234.5× lower cost (estimated)

APEX scales simulation cost across model sizes.

Numberssimilar simulation overhead from 32B up to synthesized trillion-scale models

Results

Prediction error

Value10.7% average relative error

Best latency speedup vs heuristic baseline

Value3.37×

Baselineheuristic TP-within-node + PP-across-nodes

Energy reduction vs latency-optimal plan

Valueup to 45% (with reduced GPU freq)

Baselinelatency-optimal plans

Plan search time

Value≈ 15 minutes on CPU

Baselinefull deployment evaluation on 8 H100 GPUs (hours)

Cost reduction estimate

Value≈ 1234×

Baselineactual GPU deployment costing ~$8,889

Who Should Care

What To Try In 7 Days

Run APEX with a 1–2k request trace from your service to compare heuristic vs optimized plans

Collect operation-level profiling for your cluster (one-time) and plug into APEX

Use APEX to find an energy-optimal plan under your SLO and test lowered GPU clock settings in staging

Optimization Features

Infra Optimization

  • GPU frequency tuning
  • multi-node cluster topologies

Model Optimization

  • quantization (W8A8, FP8, KV quantization)

System Optimization

  • device mapping aware of interconnect bandwidth
  • energy-vs-latency trade-off search

Inference Optimization

  • hybrid parallelism search (DP, PP, TP, EP)
  • iteration-level batching modeling
  • batch-size constraint tuning

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Operation-level profiling omits smaller ops; APEX can under-estimate absolute latency compared to full deployment.
  • Some optimal plans require parallelism features not yet supported in production serving systems.
  • Profiling a new device cluster needs ~40 GPU hours (one-time) and several hours to run scripts.

When Not To Use

  • When you need cycle-accurate, per-operation microbenchmarking on unprofiled hardware
  • If your deployment uses custom ops that are not captured by the Transformer IR without adding templates
  • For multimodal encoder/decoder setups (not yet supported; future work)

Failure Modes

  • Prediction gaps when a parallelism's real implementation is poorly optimized (example: EP vs TP mismatch).
  • Underestimation of absolute latencies because non-key operation overheads are omitted.
  • Incorrect device mapping if cluster interconnects are mischaracterized in profiling.

Core Entities

Models

  • Llama-3.1-70B
  • Llama-3.1-405B
  • Mistral-Large-123B
  • Mixtral-8x22B (MoE)
  • Qwen2.5-32B
  • synthetic trillion-scale (scaled Llama config)

Metrics

  • Time to first token (TTFT)
  • Time per output token (TPOT)
  • P95 latency
  • End-to-end latency
  • Energy consumption (KJ)
  • Model FLOPs Utilization
  • Model Bandwidth Utilization

Datasets

  • paper summarization trace (from [9])
  • modified news-abstract Creation trace
  • LMSYS-Chat-1M (Chat trace)