Train the controller to shorten the critical execution path so parallel agent teams run much faster without losing accuracy

January 15, 20266 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Xi Shi, Mengxin Zheng, Qian Lou

Links

Abstract / PDF

Why It Matters For Business

When you run multiple LLM-based agents in parallel, overall response time depends on the slowest chain of steps (the critical path). Training the orchestration policy to minimize that path reduces latency a lot without sacrificing accuracy, which helps interactive products and time-sensitive workflows.

Summary TLDR

LAMaS is a learned controller for multi-agent systems that explicitly optimizes latency under layer-wise parallel execution. On standard benchmarks it cuts the inferred critical-path length by ~38–46% vs a leading baseline while keeping similar or better task accuracy. The key idea: train with a latency penalty and assign that penalty only to the bottleneck operators that form the critical path.

Problem Statement

Multi-agent systems often assume sequential execution or minimize total cost, which does not guarantee low wall-clock latency when operators can run in parallel. Without explicit latency signals, learned orchestrators tend to produce deep, narrow graphs that hurt latency under parallel execution.

Main Contribution

Formulation: point out that under parallel execution latency is driven by the critical execution path, not total cost, and should be optimized directly.

Method: LAMaS — a controller that enables layer-wise parallelism, removes intra-layer dependencies, and trains with a critical-path-aware latency penalty.

Evaluation: show across GSM8K, HumanEval and MATH that latency-aware training shortens the critical path by 38–46% versus MaAS, without hurting accuracy.

Key Findings

LAMaS reduced critical-path length substantially compared to MaAS on three benchmarks.

NumbersCP len reduced by 38.0% (GSM8K), 42.4% (HumanEval), 46.1% (MATH)

Task performance stayed comparable or improved while reducing latency.

NumbersGSM8K: 93.13→93.37% pass/acc; MATH: 51.23→52.26% acc

Enabling parallel execution alone (removing intra-layer dependencies) does not produce the same latency gains.

NumbersAblation: CP len rose from 913.5 to 1215.9 on GSM8K when latency weight = 0

Critical-path-aware credit assignment yields measurable benefits.

NumbersHumanEval: CP len 1042.7 vs 1197.5 and score 92.11% vs 91.6% when removing CP credit

Results

Score (%)

ValueMaAS 93.13 → LAMaS 93.37 (GSM8K)

BaselineMaAS

CP len (latency proxy)

ValueMaAS 1474.6 → LAMaS 913.5

BaselineMaAS

Score (%)

ValueMaAS 93.00 → LAMaS 92.11 (HumanEval)

BaselineMaAS

CP len (latency proxy)

ValueMaAS 1810.8 → LAMaS 1042.7

BaselineMaAS

Score (%)

ValueMaAS 51.23 → LAMaS 52.26 (MATH)

BaselineMaAS

CP len (latency proxy)

ValueMaAS 2218.5 → LAMaS 1195.8

BaselineMaAS

Who Should Care

What To Try In 7 Days

Measure a critical-path proxy (sum of longest outputs per layer) for your multi-agent flows.

Remove unnecessary intra-layer dependencies so operators can run in parallel.

Add a small latency penalty to the controller reward and apply it only to bottleneck operators (critical-path credit assignment).

Agent Features

Planning

  • query-aware topology sampling

Tool Use

  • LLM API calls
  • external tool execution (mapped to token proxy)

Frameworks

  • LAMaS
  • MaAS

Is Agentic

true

Architectures

  • probabilistic agentic supernet (DAG)
  • layer-wise execution graphs with parallel operators

Collaboration

  • learned multi-agent orchestration
  • parallel operator execution

Optimization Features

Token Efficiency

  • cost penalty preserved from MaAS

System Optimization

  • remove intra-layer operator dependencies to avoid artificial synchronization

Training Optimization

  • policy-gradient training of controller
  • critical-path-aware credit assignment
  • EMA normalization for reward variance reduction

Inference Optimization

  • layer-wise parallel execution
  • threshold-based sampling to adjust per-layer width

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Latency metric is a token-based proxy (CP len), not measured wall-clock under real network/cloud conditions.
  • Real-world speedups depend on system-level factors (queues, rate limits, hardware) outside the scope of this study.
  • Experiments use a closed-source LLM (gpt-4o-mini-0718) via API, which may limit exact reproducibility for some users.

When Not To Use

  • If your deployment latency is dominated by network or infra (not model compute), algorithmic orchestration alone may not help.
  • For tiny workflows where sequential execution is cheaper and simpler.
  • When you cannot measure or approximate operator latency (no proxy available).

Failure Modes

  • Latency proxy mismatch: token-based CP len may not reflect real wall-clock bottlenecks.
  • Mis-tuned latency weight can reduce accuracy or produce under-explored architectures.
  • Incorrect credit assignment (uniform penalty) can worsen both latency and performance.

Core Entities

Models

  • gpt-4o-mini-0718

Metrics

  • Accuracy
  • pass@1
  • API cost (USD)
  • CP len (critical-path token-length proxy)

Datasets

  • GSM8K
  • HumanEval
  • MATH

Benchmarks

  • GSM8K
  • HumanEval
  • MATH