Flow: make multi-agent LLM workflows modular, run subtasks in parallel, and update the plan while running

January 14, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

1

Authors

Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu

Links

Abstract / PDF

Why It Matters For Business

Flow raises automation reliability by making plans modular and fixable at runtime; that means fewer complete failures and higher deliverable quality, though updates add compute and API cost.

Summary TLDR

Flow turns multi-agent LLM plans into editable Activity-on-Vertex (AOV) graphs, scores candidate graphs for parallelism and dependency complexity, and uses LLMs at runtime to re-generate and pick improved workflows. The system runs subtasks in parallel, clones agents to avoid waits, verifies subtask outputs, and updates only local modules when failures occur. On three coding tasks (game, LaTeX slides, website) Flow outperformed AutoGen, MetaGPT, and CAMEL in success rate and human ratings, at the cost of extra runtime when updates run.

Problem Statement

Existing LLM multi-agent systems use mostly static or sequential workflows. They struggle when subtasks fail or when the initial plan is inefficient. The paper addresses how to (1) design workflows that favor parallel, independent subtasks and (2) update the workflow during execution to fix failures or inefficiencies.

Main Contribution

Formulate multi-agent workflows as Activity-on-Vertex (AOV) directed acyclic graphs so subtasks are explicit nodes with status and logs.

Introduce simple, measurable modularity criteria (parallelism metric and dependency-complexity via degree std) and select candidate workflows that maximize parallelism and minimize dependency complexity.

Build a runtime pipeline that uses LLMs to generate K candidate updated AOVs during execution, pick the best by the same metrics, and apply local updates to improve robustness and error recovery.

Key Findings

Flow achieves much higher overall task success across three coding tasks compared to baselines.

NumbersFlow avg success rate 93% vs AutoGen 66.7 / MetaGPT 71 / CAMEL 48.7 (Tables 1–3)

Dynamic workflow updates dramatically improve recovery from broken or missing subtask outputs.

NumbersError-handling success: website 46%→87%, gobang 0%→93%, LaTeX 67%→93% (Table 4)

Flow yields higher human satisfaction on evaluated tasks.

NumbersAverage human rating 3.54/4 for Flow vs 2.63 (AutoGen), 1.60 (MetaGPT), 2.12 (CAMEL)

Updates increase runtime but often remain faster or competitive versus some baselines.

NumbersExample (GPT-3.5) gobang: Flow w/o update 26.1s → Flow w/ update 33.6s; other frameworks run 31–121s (Table 9)

Results

Average success rate (three tasks)

Value93%

BaselineAutoGen 66.7% / MetaGPT 71% / CAMEL 48.7%

Human rating (1-4) average

Value3.54 / 4

BaselineAutoGen 2.63 / MetaGPT 1.60 / CAMEL 2.12

Error-handling success improvement (with dynamic updates)

ValueWebsite 46%→87%, Gobang 0%→93%, LaTeX 67%→93%

BaselineFlow without updates

Runtime trade-off (example)

ValueFlow (w/o update) 26.12s → Flow (w/ update) 33.57s (gobang, GPT-3.5)

BaselineMetaGPT 34.00s; CAMEL 121.52s

Who Should Care

What To Try In 7 Days

Model one internal multi-step job as an AOV graph and score candidate splits by parallelism and dependency std.

Run a small pilot comparing static vs Flow-style dynamic updates on one coding or document task and track success rate and runtime.

Enable lightweight verification steps ('did this subtask meet its requirements?') to reduce silent failures before expanding updates.

Agent Features

Memory

  • Dictionary-based workflow state (short-term)
  • No long-term retrieval memory reported

Planning

  • LLM-generated candidate AOV graphs
  • Topological sort for execution steps
  • Selection by parallelism and dependency complexity

Tool Use

  • GPT-4o-mini
  • GPT-3.5-Turbo
  • agent cloning to run same-agent subtasks concurrently

Frameworks

  • Activity-on-Vertex (AOV) graph
  • Dictionary/JSON workflow structure

Is Agentic

true

Architectures

  • LLM-based multi-agent system
  • AOV graph workflow representation

Collaboration

  • Parallel subtask execution
  • Global inspector LLM for monitoring and updates
  • Agent reassignment and cloning for concurrency

Optimization Features

Token Efficiency

  • When updates are returned, omit 'data' fields to save tokens (Appendix D.3)

System Optimization

  • Select workflow maximizing parallelism to reduce steps
  • Dependency std metric to avoid bottlenecks

Training Optimization

  • Notes: possible future RL fine-tuning for workflow generation

Inference Optimization

  • Clone agents to avoid wait time when same agent needed in parallel

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations concentrate on three coding-style tasks; generalization to other domains is untested.
  • Workflow updater needs global information; scaling to very large contexts can be problematic.
  • Selected candidate graphs come from the same LLM and may not always yield optimal workflows without specialized training.

When Not To Use

  • If API cost or latency budget forbids extra update calls
  • When tasks require strict, deterministic outputs that cannot tolerate LLM variance
  • If the workflow context is too large for the chosen LLM to summarize reliably

Failure Modes

  • LLM misreports a subtask as 'completed' causing downstream errors
  • Over-aggressive updates create redundant API calls and wasted compute
  • Initial candidate graphs miss critical dependencies leading to repeated repairs

Core Entities

Models

  • GPT-4o-mini
  • GPT-3.5-Turbo

Metrics

  • Success Rate
  • Human Rating
  • Compilable / Interactable / Completeness per task

Context Entities

Models

  • GPT-4o-mini
  • GPT-3.5-Turbo

Metrics

  • Parallelism metric
  • Dependency complexity (degree std)
  • Execution time