Flow: make multi-agent LLM workflows modular, run subtasks in parallel, and update the plan while running

January 14, 20258 min

Overview

Decision SnapshotNeeds Validation

Flow shows clear practical gains on small-to-medium coding tasks, but results are limited to the evaluated tasks and depend on LLM quality and API budget.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu

Links

Abstract / PDF / Code

Why It Matters For Business

Flow raises automation reliability by making plans modular and fixable at runtime; that means fewer complete failures and higher deliverable quality, though updates add compute and API cost.

Who Should Care

Summary TLDR

Flow turns multi-agent LLM plans into editable Activity-on-Vertex (AOV) graphs, scores candidate graphs for parallelism and dependency complexity, and uses LLMs at runtime to re-generate and pick improved workflows. The system runs subtasks in parallel, clones agents to avoid waits, verifies subtask outputs, and updates only local modules when failures occur. On three coding tasks (game, LaTeX slides, website) Flow outperformed AutoGen, MetaGPT, and CAMEL in success rate and human ratings, at the cost of extra runtime when updates run.

Problem Statement

Existing LLM multi-agent systems use mostly static or sequential workflows. They struggle when subtasks fail or when the initial plan is inefficient. The paper addresses how to (1) design workflows that favor parallel, independent subtasks and (2) update the workflow during execution to fix failures or inefficiencies.

Main Contribution

Formulate multi-agent workflows as Activity-on-Vertex (AOV) directed acyclic graphs so subtasks are explicit nodes with status and logs.

Introduce simple, measurable modularity criteria (parallelism metric and dependency-complexity via degree std) and select candidate workflows that maximize parallelism and minimize dependency complexity.

Key Findings

Flow achieves much higher overall task success across three coding tasks compared to baselines.

NumbersFlow avg success rate 93% vs AutoGen 66.7 / MetaGPT 71 / CAMEL 48.7 (Tables 13)

Practical UseUse Flow-style AOV planning to boost end-to-end success for multi-step coding workflows in practice.

Evidence RefTables 1-3, Section 4.1

Dynamic workflow updates dramatically improve recovery from broken or missing subtask outputs.

NumbersError-handling success: website 46%87%, gobang 0%93%, LaTeX 67%93% (Table 4)

Practical UseEnable runtime updates when agents fail — it converts many terminal failures into recoverable cases.

Evidence RefTable 4, Section 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average success rate (three tasks)93%AutoGen 66.7% / MetaGPT 71% / CAMEL 48.7%Flow +~26pp vs AutoGenWebsite, LaTeX, Gobang aggregateSection 4.1 summary; Tables 1-3Tables 1-3
Human rating (1-4) average3.54 / 4AutoGen 2.63 / MetaGPT 1.60 / CAMEL 2.12Flow +0.91 vs AutoGenWebsite, LaTeX, Gobang aggregateSection 4.1 summary; Tables 1-3Tables 1-3

What To Try In 7 Days

Model one internal multi-step job as an AOV graph and score candidate splits by parallelism and dependency std.

Run a small pilot comparing static vs Flow-style dynamic updates on one coding or document task and track success rate and runtime.

Enable lightweight verification steps ('did this subtask meet its requirements?') to reduce silent failures before expanding updates.

Agent Features

Memory
Dictionary-based workflow state (short-term)No long-term retrieval memory reported
Planning
LLM-generated candidate AOV graphsTopological sort for execution stepsSelection by parallelism and dependency complexity
Tool Use
GPT-4o-miniGPT-3.5-Turboagent cloning to run same-agent subtasks concurrently
Frameworks
Activity-on-Vertex (AOV) graphDictionary/JSON workflow structure
Is Agentic

Yes

Architectures
LLM-based multi-agent systemAOV graph workflow representation
Collaboration
Parallel subtask executionGlobal inspector LLM for monitoring and updatesAgent reassignment and cloning for concurrency

Optimization Features

Token Efficiency
When updates are returned, omit 'data' fields to save tokens (Appendix D.3)
System Optimization
Select workflow maximizing parallelism to reduce stepsDependency std metric to avoid bottlenecks
Training Optimization
Notes: possible future RL fine-tuning for workflow generation
Inference Optimization
Clone agents to avoid wait time when same agent needed in parallel

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations concentrate on three coding-style tasks; generalization to other domains is untested.

Workflow updater needs global information; scaling to very large contexts can be problematic.

When Not To Use

If API cost or latency budget forbids extra update calls

When tasks require strict, deterministic outputs that cannot tolerate LLM variance

Failure Modes

LLM misreports a subtask as 'completed' causing downstream errors

Over-aggressive updates create redundant API calls and wasted compute

Core Entities

Models

GPT-4o-miniGPT-3.5-Turbo

Metrics

Success RateHuman RatingCompilable / Interactable / Completeness per task

Context Entities

Models

GPT-4o-miniGPT-3.5-Turbo

Metrics

Parallelism metricDependency complexity (degree std)Execution time