CoAct: a two-tier global planner + local executor that improves long-horizon web task success

June 19, 20247 min

Overview

Decision SnapshotNeeds Validation

The method is easy to implement with prompt engineering but shows modest absolute gains; main limits are planner quality and missing memory.

Citations1

Evidence Strength0.60

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Xinming Hou, Mingming Yang, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Wayne Xin Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A simple global/local agent split can reduce looped interactions and improve automation success on multi-step web tasks, making web automation more robust with modest engineering effort.

Who Should Care

Summary TLDR

CoAct is a simple two-agent framework: a global planner that makes phased high-level plans and a local executor that runs and checks subtasks. On the WebArena web-navigation benchmark, CoAct (with GPT-3.5) raises task success from 9.4% (ReAct baseline) to 13.8%, and to 16.0% when using a force-stop cap. Errors remain: ~40% stem from weak global plans and ~60% from repetitive actions and missing memory. Adding short web-page search snippets improves results further.

Problem Statement

Large language models still fail many multi-step web tasks because single-agent prompting hits attention and planning limits. Agents can loop on observations, repeat actions, and fail to replan globally. The paper asks: can a simple hierarchical multi-agent setup improve robustness on long-horizon web tasks?

Main Contribution

CoAct framework: a two-agent hierarchy (global planner + local executor) for phased task decomposition and replanning.

Empirical evaluation on the WebArena web-navigation benchmark showing consistent gains over ReAct using gpt-3.5-turbo-16k.

Key Findings

CoAct raises average task success vs ReAct on WebArena.

NumbersAvg SR: ReAct 9.4% → CoAct 13.8% (+4.4pp, +47%)

Practical UseUse a lightweight global planner to decompose web tasks; expect modest absolute gains but substantial relative improvement over single-agent prompting.

Evidence RefTable 1; Section 3.2

Force-stop dialog limit further improves success.

NumbersAvg SR: ReAct 9.4% → CoAct w/ FS 16.0% (+6.6pp, +70%)

Practical UseCap long dialogues or exchanges to avoid repetitive loops; a simple stop rule improved outcomes noticeably.

Evidence RefTable 1; Section 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Success Rate (Shop)ReAct 12.0% | CoAct 22.0% | CoAct w/ FS 24.0% | HUMAN -ReAct 12.0%CoAct +10.0pp over ReActWebArena - ShopTable 1: per-task SRTable 1
Success Rate (CMS)ReAct 11.0% | CoAct 14.0% | CoAct w/ FS 17.0%ReAct 11.0%CoAct +3.0ppWebArena - CMSTable 1: per-task SRTable 1

What To Try In 7 Days

Prototype a global-plan + local-executor prompt split for a web automation workflow.

Add a dialogue/exchange cap (force-stop) to stop repetitive action loops.

Plug short, page-specific text snippets (<=100 words) into your planner to reduce planning errors.

Agent Features

Memory
no efficient memory mechanism implementedrecommendation: add memory/experience to avoid repetition
Planning
macro-level global planninglocal per-phase execution planningreplanning on agent request
Tool Use
web navigation actions (page operations)search engine snippets for planning (optional)
Frameworks
CoAct
Is Agentic

Yes

Architectures
global-local (hierarchical) two-agentphase-based task decomposition
Collaboration
request/revise/overrule interaction loop between agentslocal agent validates and can request global replanning

Optimization Features

System Optimization
dialogue force-stop to reduce loopscontext partitioning via phase prompts

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

≈40% failures due to weak global planner and insufficient page-specific knowledge

≈60% failures caused by iterative/repetitive actions and lack of memory

When Not To Use

When you need near-human reliability on complex web tasks today

When you cannot run repeated LLM calls due to cost or latency

Failure Modes

Poor global plan leads to wrong decomposition and downstream errors

Local agent repeats actions and exhausts exchange limits

Core Entities

Models

gpt-3.5-turbo-16k-0613ReAct (baseline)

Metrics

Success Rate (SR)

Datasets

WebArena

Benchmarks

WebArena

Context Entities

Datasets

WebArena (Zhou et al., 2023)