CoAct: a two-tier global planner + local executor that improves long-horizon web task success

June 19, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

1

Authors

Xinming Hou, Mingming Yang, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Wayne Xin Zhao

Links

Abstract / PDF

Why It Matters For Business

A simple global/local agent split can reduce looped interactions and improve automation success on multi-step web tasks, making web automation more robust with modest engineering effort.

Summary TLDR

CoAct is a simple two-agent framework: a global planner that makes phased high-level plans and a local executor that runs and checks subtasks. On the WebArena web-navigation benchmark, CoAct (with GPT-3.5) raises task success from 9.4% (ReAct baseline) to 13.8%, and to 16.0% when using a force-stop cap. Errors remain: ~40% stem from weak global plans and ~60% from repetitive actions and missing memory. Adding short web-page search snippets improves results further.

Problem Statement

Large language models still fail many multi-step web tasks because single-agent prompting hits attention and planning limits. Agents can loop on observations, repeat actions, and fail to replan globally. The paper asks: can a simple hierarchical multi-agent setup improve robustness on long-horizon web tasks?

Main Contribution

CoAct framework: a two-agent hierarchy (global planner + local executor) for phased task decomposition and replanning.

Empirical evaluation on the WebArena web-navigation benchmark showing consistent gains over ReAct using gpt-3.5-turbo-16k.

Analysis of failure modes and preliminary improvement by injecting short, page-specific search snippets into global planning.

Key Findings

CoAct raises average task success vs ReAct on WebArena.

NumbersAvg SR: ReAct 9.4% → CoAct 13.8% (+4.4pp, +47%)

Force-stop dialog limit further improves success.

NumbersAvg SR: ReAct 9.4% → CoAct w/ FS 16.0% (+6.6pp, +70%)

Adding short web-page search snippets to the planner boosts success on tested tasks.

NumbersShop: CoAct w/ FS 24% → +SEARCH ENGINE 31%; GitLab: 10% → 19%

Failure breakdown on medium-difficulty cases: planning vs repetition.

Numbers≈40% planning inadequacies; ≈60% iterative/repetitive actions

Tasks split by difficulty in Shop: many are multi-step.

NumbersShop difficulty: Easy 30%, Medium 50%, Hard 20%

Results

Success Rate (Shop)

ValueReAct 12.0% | CoAct 22.0% | CoAct w/ FS 24.0% | HUMAN -

BaselineReAct 12.0%

Success Rate (CMS)

ValueReAct 11.0% | CoAct 14.0% | CoAct w/ FS 17.0%

BaselineReAct 11.0%

Success Rate (Reddit)

ValueReAct 9.0% | CoAct 12.0% | CoAct w/ FS 14.0%

BaselineReAct 9.0%

Success Rate (Gitlab)

ValueReAct 7.0% | CoAct 9.0% | CoAct w/ FS 10.0%

BaselineReAct 7.0%

Success Rate (Map)

ValueReAct 8.0% | CoAct 12.0% | CoAct w/ FS 15.0%

BaselineReAct 8.0%

Average Success Rate (all tasks)

ValueReAct 9.4% | CoAct 13.8% | CoAct w/ FS 16.0% | HUMAN 78.2%

BaselineReAct 9.4%

Who Should Care

What To Try In 7 Days

Prototype a global-plan + local-executor prompt split for a web automation workflow.

Add a dialogue/exchange cap (force-stop) to stop repetitive action loops.

Plug short, page-specific text snippets (<=100 words) into your planner to reduce planning errors.

Agent Features

Memory

  • no efficient memory mechanism implemented
  • recommendation: add memory/experience to avoid repetition

Planning

  • macro-level global planning
  • local per-phase execution planning
  • replanning on agent request

Tool Use

  • web navigation actions (page operations)
  • search engine snippets for planning (optional)

Frameworks

  • CoAct

Is Agentic

true

Architectures

  • global-local (hierarchical) two-agent
  • phase-based task decomposition

Collaboration

  • request/revise/overrule interaction loop between agents
  • local agent validates and can request global replanning

Optimization Features

System Optimization

  • dialogue force-stop to reduce loops
  • context partitioning via phase prompts

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • ≈40% failures due to weak global planner and insufficient page-specific knowledge
  • ≈60% failures caused by iterative/repetitive actions and lack of memory
  • Performance still far below humans (human avg SR 78.2% vs CoAct 13.8%)
  • Relies on proprietary LLM (gpt-3.5) and WebArena synthetic web environments

When Not To Use

  • When you need near-human reliability on complex web tasks today
  • When you cannot run repeated LLM calls due to cost or latency
  • When a single-step API or deterministic automation already suffices

Failure Modes

  • Poor global plan leads to wrong decomposition and downstream errors
  • Local agent repeats actions and exhausts exchange limits
  • Overaccumulation of context prevents recognizing failure and switching strategies

Core Entities

Models

  • gpt-3.5-turbo-16k-0613
  • ReAct (baseline)

Metrics

  • Success Rate (SR)

Datasets

  • WebArena

Benchmarks

  • WebArena

Context Entities

Datasets

  • WebArena (Zhou et al., 2023)