Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
APEX lets ops teams find faster or cheaper LLM serving configurations without burning expensive GPU hours, enabling targeted trade-offs between latency, throughput, and energy while meeting SLOs.
Summary TLDR
APEX is a simulator that searches and evaluates parallel execution plans for serving large language models. It models iteration-level batching (requests added dynamically as memory frees), operation-level profiled cost, and hybrid parallelisms (data/pipeline/tensor/expert). On evaluated traces and clusters APEX predicts performance with 10.7% average relative error, finds plans up to 3.37× faster than common heuristics, and can surface energy-optimal plans that cut energy by up to 45% when trading latency. APEX runs on a CPU, finds plans within ~15 minutes, and reduces time and monetary cost compared to live GPU testing (71× faster, ~1234× cheaper on reported setup). Code is available.
Problem Statement
Choosing how to parallelize an LLM across many devices is hard. Iteration-level batching makes batch sizes dynamic and interleaves prefill/decode work. The design space explodes with model size, cluster topology, quantization, and hybrid parallelisms. Exhaustive deployment testing is prohibitively slow and expensive, and static heuristics can be far from optimal in practice.
Main Contribution
APEX simulator that automatically generates and evaluates parallel execution plans for LLM serving.
Dynamism-aware simulation: models iteration-level batching and mixed prefill/decode stages.
Operation-level profiling + Transformer IR to scale simulation to billion- and trillion-scale models.
Supports hybrid parallelisms (DP, PP, TP, EP), quantizations, and diverse cluster topologies; modular for extension.
Demonstrated gains: accurate predictions (avg. 10.7% error), up to 3.37× latency speedup, and large energy/cost savings.
Key Findings
APEX prediction fidelity is high.
Optimized parallel plans can greatly reduce latency vs common heuristics.
Energy-focused plans can cut energy significantly by trading latency or clock frequency.
Simulation is much faster and cheaper than live evaluation.
APEX scales simulation cost across model sizes.
Results
Prediction error
Best latency speedup vs heuristic baseline
Energy reduction vs latency-optimal plan
Plan search time
Cost reduction estimate
Who Should Care
What To Try In 7 Days
Run APEX with a 1–2k request trace from your service to compare heuristic vs optimized plans
Collect operation-level profiling for your cluster (one-time) and plug into APEX
Use APEX to find an energy-optimal plan under your SLO and test lowered GPU clock settings in staging
Optimization Features
Infra Optimization
- GPU frequency tuning
- multi-node cluster topologies
Model Optimization
- quantization (W8A8, FP8, KV quantization)
System Optimization
- device mapping aware of interconnect bandwidth
- energy-vs-latency trade-off search
Inference Optimization
- hybrid parallelism search (DP, PP, TP, EP)
- iteration-level batching modeling
- batch-size constraint tuning
Reproducibility
Code Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Operation-level profiling omits smaller ops; APEX can under-estimate absolute latency compared to full deployment.
- Some optimal plans require parallelism features not yet supported in production serving systems.
- Profiling a new device cluster needs ~40 GPU hours (one-time) and several hours to run scripts.
When Not To Use
- When you need cycle-accurate, per-operation microbenchmarking on unprofiled hardware
- If your deployment uses custom ops that are not captured by the Transformer IR without adding templates
- For multimodal encoder/decoder setups (not yet supported; future work)
Failure Modes
- Prediction gaps when a parallelism's real implementation is poorly optimized (example: EP vs TP mismatch).
- Underestimation of absolute latencies because non-key operation overheads are omitted.
- Incorrect device mapping if cluster interconnects are mischaracterized in profiling.
Core Entities
Models
- Llama-3.1-70B
- Llama-3.1-405B
- Mistral-Large-123B
- Mixtral-8x22B (MoE)
- Qwen2.5-32B
- synthetic trillion-scale (scaled Llama config)
Metrics
- Time to first token (TTFT)
- Time per output token (TPOT)
- P95 latency
- End-to-end latency
- Energy consumption (KJ)
- Model FLOPs Utilization
- Model Bandwidth Utilization
Datasets
- paper summarization trace (from [9])
- modified news-abstract Creation trace
- LMSYS-Chat-1M (Chat trace)

