Train the controller to shorten the critical execution path so parallel agent teams run much faster without losing accuracy

January 15, 20266 min

Overview

Decision SnapshotReady For Pilot

Paper gives clear experimental numbers on standard benchmarks and ablations. Latency is measured via a token-based proxy rather than real wall-clock, which increases reproducibility but limits real-world guarantees.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Xi Shi, Mengxin Zheng, Qian Lou

Links

Abstract / PDF / Code

Why It Matters For Business

When you run multiple LLM-based agents in parallel, overall response time depends on the slowest chain of steps (the critical path). Training the orchestration policy to minimize that path reduces latency a lot without sacrificing accuracy, which helps interactive products and time-sensitive workflows.

Who Should Care

Summary TLDR

LAMaS is a learned controller for multi-agent systems that explicitly optimizes latency under layer-wise parallel execution. On standard benchmarks it cuts the inferred critical-path length by ~38–46% vs a leading baseline while keeping similar or better task accuracy. The key idea: train with a latency penalty and assign that penalty only to the bottleneck operators that form the critical path.

Problem Statement

Multi-agent systems often assume sequential execution or minimize total cost, which does not guarantee low wall-clock latency when operators can run in parallel. Without explicit latency signals, learned orchestrators tend to produce deep, narrow graphs that hurt latency under parallel execution.

Main Contribution

Formulation: point out that under parallel execution latency is driven by the critical execution path, not total cost, and should be optimized directly.

Method: LAMaS — a controller that enables layer-wise parallelism, removes intra-layer dependencies, and trains with a critical-path-aware latency penalty.

Key Findings

LAMaS reduced critical-path length substantially compared to MaAS on three benchmarks.

NumbersCP len reduced by 38.0% (GSM8K), 42.4% (HumanEval), 46.1% (MATH)

Practical UseIf you need lower wall-clock latency for parallel agent workflows, add latency supervision and critical-path credit assignment to the orchestration controller.

Evidence RefTable 2

Task performance stayed comparable or improved while reducing latency.

NumbersGSM8K: 93.1393.37% pass/acc; MATH: 51.2352.26% acc

Practical UseYou can shorten execution time without trading away accuracy on common reasoning and code benchmarks, so latency-aware training is low-risk for many tasks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Score (%)MaAS 93.13 → LAMaS 93.37 (GSM8K)MaAS+0.24%GSM8KTable 2: GSM8K scoresTable 2
CP len (latency proxy)MaAS 1474.6 → LAMaS 913.5MaAS-38.0%GSM8KTable 2: GSM8K CP lenTable 2

What To Try In 7 Days

Measure a critical-path proxy (sum of longest outputs per layer) for your multi-agent flows.

Remove unnecessary intra-layer dependencies so operators can run in parallel.

Add a small latency penalty to the controller reward and apply it only to bottleneck operators (critical-path credit assignment).

Agent Features

Planning
query-aware topology sampling
Tool Use
LLM API callsexternal tool execution (mapped to token proxy)
Frameworks
LAMaSMaAS
Is Agentic

Yes

Architectures
probabilistic agentic supernet (DAG)layer-wise execution graphs with parallel operators
Collaboration
learned multi-agent orchestrationparallel operator execution

Optimization Features

Token Efficiency
cost penalty preserved from MaAS
System Optimization
remove intra-layer operator dependencies to avoid artificial synchronization
Training Optimization
policy-gradient training of controllercritical-path-aware credit assignmentEMA normalization for reward variance reduction
Inference Optimization
layer-wise parallel executionthreshold-based sampling to adjust per-layer width

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Latency metric is a token-based proxy (CP len), not measured wall-clock under real network/cloud conditions.

Real-world speedups depend on system-level factors (queues, rate limits, hardware) outside the scope of this study.

When Not To Use

If your deployment latency is dominated by network or infra (not model compute), algorithmic orchestration alone may not help.

For tiny workflows where sequential execution is cheaper and simpler.

Failure Modes

Latency proxy mismatch: token-based CP len may not reflect real wall-clock bottlenecks.

Mis-tuned latency weight can reduce accuracy or produce under-explored architectures.

Core Entities

Models

gpt-4o-mini-0718

Metrics

Accuracypass@1API cost (USD)CP len (critical-path token-length proxy)

Datasets

GSM8KHumanEvalMATH

Benchmarks

GSM8KHumanEvalMATH