Flow: make multi-agent LLM workflows modular, run subtasks in parallel, and update the plan while running

Overview

Decision SnapshotNeeds Validation

Flow shows clear practical gains on small-to-medium coding tasks, but results are limited to the evaluated tasks and depend on LLM quality and API budget.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu

Links

Abstract / PDF / Code

Why It Matters For Business

Flow raises automation reliability by making plans modular and fixable at runtime; that means fewer complete failures and higher deliverable quality, though updates add compute and API cost.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

Flow turns multi-agent LLM plans into editable Activity-on-Vertex (AOV) graphs, scores candidate graphs for parallelism and dependency complexity, and uses LLMs at runtime to re-generate and pick improved workflows. The system runs subtasks in parallel, clones agents to avoid waits, verifies subtask outputs, and updates only local modules when failures occur. On three coding tasks (game, LaTeX slides, website) Flow outperformed AutoGen, MetaGPT, and CAMEL in success rate and human ratings, at the cost of extra runtime when updates run.

Problem Statement

Existing LLM multi-agent systems use mostly static or sequential workflows. They struggle when subtasks fail or when the initial plan is inefficient. The paper addresses how to (1) design workflows that favor parallel, independent subtasks and (2) update the workflow during execution to fix failures or inefficiencies.

Main Contribution

Formulate multi-agent workflows as Activity-on-Vertex (AOV) directed acyclic graphs so subtasks are explicit nodes with status and logs.

Introduce simple, measurable modularity criteria (parallelism metric and dependency-complexity via degree std) and select candidate workflows that maximize parallelism and minimize dependency complexity.

Key Findings

Flow achieves much higher overall task success across three coding tasks compared to baselines.

NumbersFlow avg success rate 93% vs AutoGen 66.7 / MetaGPT 71 / CAMEL 48.7 (Tables 1–3)

Practical UseUse Flow-style AOV planning to boost end-to-end success for multi-step coding workflows in practice.

Evidence RefTables 1-3, Section 4.1

Dynamic workflow updates dramatically improve recovery from broken or missing subtask outputs.

NumbersError-handling success: website 46%→87%, gobang 0%→93%, LaTeX 67%→93% (Table 4)

Practical UseEnable runtime updates when agents fail — it converts many terminal failures into recoverable cases.

Evidence RefTable 4, Section 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average success rate (three tasks)	93%	AutoGen 66.7% / MetaGPT 71% / CAMEL 48.7%	Flow +~26pp vs AutoGen	Website, LaTeX, Gobang aggregate	Section 4.1 summary; Tables 1-3	Tables 1-3
Human rating (1-4) average	3.54 / 4	AutoGen 2.63 / MetaGPT 1.60 / CAMEL 2.12	Flow +0.91 vs AutoGen	Website, LaTeX, Gobang aggregate	Section 4.1 summary; Tables 1-3	Tables 1-3

What To Try In 7 Days

Model one internal multi-step job as an AOV graph and score candidate splits by parallelism and dependency std.

Run a small pilot comparing static vs Flow-style dynamic updates on one coding or document task and track success rate and runtime.

Enable lightweight verification steps ('did this subtask meet its requirements?') to reduce silent failures before expanding updates.

Agent Features

Memory

Dictionary-based workflow state (short-term)No long-term retrieval memory reported

Planning

LLM-generated candidate AOV graphsTopological sort for execution stepsSelection by parallelism and dependency complexity

Tool Use

GPT-4o-miniGPT-3.5-Turboagent cloning to run same-agent subtasks concurrently

Frameworks

Activity-on-Vertex (AOV) graphDictionary/JSON workflow structure

Is Agentic

Yes

Architectures

LLM-based multi-agent systemAOV graph workflow representation

Collaboration

Parallel subtask executionGlobal inspector LLM for monitoring and updatesAgent reassignment and cloning for concurrency

Optimization Features

Token Efficiency

When updates are returned, omit 'data' fields to save tokens (Appendix D.3)

System Optimization

Select workflow maximizing parallelism to reduce stepsDependency std metric to avoid bottlenecks

Training Optimization

Notes: possible future RL fine-tuning for workflow generation

Inference Optimization

Clone agents to avoid wait time when same agent needed in parallel

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/tmllab/2025_ICLR_FLOW

Risks & Boundaries

Limitations

Evaluations concentrate on three coding-style tasks; generalization to other domains is untested.

Workflow updater needs global information; scaling to very large contexts can be problematic.

When Not To Use

If API cost or latency budget forbids extra update calls

When tasks require strict, deterministic outputs that cannot tolerate LLM variance

Failure Modes

LLM misreports a subtask as 'completed' causing downstream errors

Over-aggressive updates create redundant API calls and wasted compute

Core Entities

Models

GPT-4o-miniGPT-3.5-Turbo

Metrics

Success RateHuman RatingCompilable / Interactable / Completeness per task

Context Entities

Models

GPT-4o-miniGPT-3.5-Turbo

Metrics

Parallelism metricDependency complexity (degree std)Execution time

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Flow achieves much higher overall task success across three coding tasks compared to baselines.

Dynamic workflow updates dramatically improve recovery from broken or missing subtask outputs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

Metrics

You May Also Want to Read

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding