CoAct: a two-tier global planner + local executor that improves long-horizon web task success

Overview

Decision SnapshotNeeds Validation

The method is easy to implement with prompt engineering but shows modest absolute gains; main limits are planner quality and missing memory.

Citations1

Evidence Strength0.60

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Xinming Hou, Mingming Yang, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Wayne Xin Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A simple global/local agent split can reduce looped interactions and improve automation success on multi-step web tasks, making web automation more robust with modest engineering effort.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Founder

Summary TLDR

CoAct is a simple two-agent framework: a global planner that makes phased high-level plans and a local executor that runs and checks subtasks. On the WebArena web-navigation benchmark, CoAct (with GPT-3.5) raises task success from 9.4% (ReAct baseline) to 13.8%, and to 16.0% when using a force-stop cap. Errors remain: ~40% stem from weak global plans and ~60% from repetitive actions and missing memory. Adding short web-page search snippets improves results further.

Problem Statement

Large language models still fail many multi-step web tasks because single-agent prompting hits attention and planning limits. Agents can loop on observations, repeat actions, and fail to replan globally. The paper asks: can a simple hierarchical multi-agent setup improve robustness on long-horizon web tasks?

Main Contribution

CoAct framework: a two-agent hierarchy (global planner + local executor) for phased task decomposition and replanning.

Empirical evaluation on the WebArena web-navigation benchmark showing consistent gains over ReAct using gpt-3.5-turbo-16k.

Key Findings

CoAct raises average task success vs ReAct on WebArena.

NumbersAvg SR: ReAct 9.4% → CoAct 13.8% (+4.4pp, +47%)

Practical UseUse a lightweight global planner to decompose web tasks; expect modest absolute gains but substantial relative improvement over single-agent prompting.

Evidence RefTable 1; Section 3.2

Force-stop dialog limit further improves success.

NumbersAvg SR: ReAct 9.4% → CoAct w/ FS 16.0% (+6.6pp, +70%)

Practical UseCap long dialogues or exchanges to avoid repetitive loops; a simple stop rule improved outcomes noticeably.

Evidence RefTable 1; Section 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Success Rate (Shop)	ReAct 12.0% \| CoAct 22.0% \| CoAct w/ FS 24.0% \| HUMAN -	ReAct 12.0%	CoAct +10.0pp over ReAct	WebArena - Shop	Table 1: per-task SR	Table 1
Success Rate (CMS)	ReAct 11.0% \| CoAct 14.0% \| CoAct w/ FS 17.0%	ReAct 11.0%	CoAct +3.0pp	WebArena - CMS	Table 1: per-task SR	Table 1

What To Try In 7 Days

Prototype a global-plan + local-executor prompt split for a web automation workflow.

Add a dialogue/exchange cap (force-stop) to stop repetitive action loops.

Plug short, page-specific text snippets (<=100 words) into your planner to reduce planning errors.

Agent Features

Memory

no efficient memory mechanism implementedrecommendation: add memory/experience to avoid repetition

Planning

macro-level global planninglocal per-phase execution planningreplanning on agent request

Tool Use

web navigation actions (page operations)search engine snippets for planning (optional)

Frameworks

CoAct

Is Agentic

Yes

Architectures

global-local (hierarchical) two-agentphase-based task decomposition

Collaboration

request/revise/overrule interaction loop between agentslocal agent validates and can request global replanning

Optimization Features

System Optimization

dialogue force-stop to reduce loopscontext partitioning via phase prompts

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/xmhou2002/CoAct

Data URLs

https://arxiv.org/abs/2307.13854 (WebArena dataset paper)

Risks & Boundaries

Limitations

≈40% failures due to weak global planner and insufficient page-specific knowledge

≈60% failures caused by iterative/repetitive actions and lack of memory

When Not To Use

When you need near-human reliability on complex web tasks today

When you cannot run repeated LLM calls due to cost or latency

Failure Modes

Poor global plan leads to wrong decomposition and downstream errors

Local agent repeats actions and exhausts exchange limits

Core Entities

Models

gpt-3.5-turbo-16k-0613ReAct (baseline)

Metrics

Success Rate (SR)

Datasets

WebArena

Benchmarks

WebArena

Context Entities

Datasets

WebArena (Zhou et al., 2023)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CoAct raises average task success vs ReAct on WebArena.

Force-stop dialog limit further improves success.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding