Teach agents reusable web workflows from past traces to boost web-navigation success

September 11, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

1

Authors

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

Links

Abstract / PDF

Why It Matters For Business

Inducing and reusing compact workflows turns past agent traces into practical, reusable skills that increase success rates and reduce execution steps on web automation tasks, saving time and API costs.

Summary TLDR

The paper introduces Agent Workflow Memory (AWM): a lightweight system that extracts reusable sub-routines (workflows) from past agent trajectories and injects them into an agent's memory. AWM works offline (from training examples) or online (from successful test-time runs judged by an evaluator). On two web-navigation benchmarks it raises success rates substantially (WebArena: +12.0 absolute / +51.1% relative over a strong baseline; Mind2Web: +8.9 absolute step SR / +24.6% relative) while reducing the number of action steps. The method is implementation-light: workflows are text/code snippets stored in memory, induced by prompting an LLM, and used either as context or as callable high-level

Problem Statement

Current LM-based agents struggle on long, multi-step web tasks because they do not learn reusable routines between tasks and thus cannot adapt over time. The paper aims to extract common sub-routines from past trajectories and store them as workflows in agent memory so future tasks run faster and more reliably.

Main Contribution

Agent Workflow Memory (AWM): a method that induces reusable sub-routines (workflows) from past trajectories and adds them to agent memory.

Supports both offline induction from annotated examples and online, supervision-free induction from judged successful runs.

Shows large, practical gains on two web-navigation benchmarks (WebArena and Mind2Web) with modest implementation changes.

Open-source code and prompts released to reproduce induction and usage: https://github.com/zorazrw/agent-workflow-memory

Key Findings

AWM raises overall success rate on WebArena versus a strong autonomous baseline.

NumbersAWM 35.5 SR vs BrowserGym 23.5 SR; +12.0 abs (+51.1% rel)

AWM improves step-wise success on Mind2Web by extracting reusable sub-routines.

NumbersStep SR AWM (gpt-4) 45.1 vs MindAct 36.2; +8.9 abs (+24.6% rel)

AWM reduces the number of actions needed to reach goals.

NumbersAWM uses ~2.0 fewer steps per successful task vs BrowserGym; 40.8 fewer steps vs AutoEval

Online AWM generalizes better as train-test distribution gaps grow.

NumbersAWM online beats baselines by 8.9–14.0 absolute points as gaps widen

Results

Total task success rate (WebArena, gpt-4)

Value35.5% (AWM)

Baseline23.5% (BrowserGym)

Step success rate (Mind2Web cross-task, gpt-4)

Value45.1% (AWM)

Baseline36.2% (MindAct)

Accuracy

Value50.6% (AWM)

Baseline41.6% (MindAct)

Average steps per successful task (WebArena, gpt-4)

Value5.9 steps (AWM)

Baseline7.9 steps (BrowserGym ax-tree)

Who Should Care

What To Try In 7 Days

Log several successful agent runs per site and prompt an LLM to extract common sub-routines as workflows.

Add those workflows to agent system prompts and re-run a small test set to measure task- and step-level success.

Run AWM online (streaming) on production queries to adapt to site drift without collecting new annotated data.

Agent Features

Memory

  • textual workflow memory (offline induced or streamed online)

Planning

  • observe-act loop
  • callable workflow actions (AWM AS)

Tool Use

  • primitive web actions (CLICK, TYPE, select, fill)
  • workflow actions as high-level tools

Frameworks

  • BrowserGym
  • AutoEval

Is Agentic

true

Architectures

  • LM-based agent (text backbone)

Optimization Features

Token Efficiency

  • Abstract workflows reduce example-specific context to save prompt space

Reproducibility

Data Urls

  • WebArena (Zhou et al., 2024)
  • Mind2Web (Deng et al., 2023)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Online induction depends on correct automatic evaluation; wrong judgments can create harmful workflows (§2.3, §3.2).
  • Workflows can bias actions and lower action F1 when the workflow is not perfectly relevant to the current state (§3.2.1).
  • Workflow actions are brittle under dynamic intermediate states (pop-up choices), limiting callable-action utility (§5).
  • Combining offline and online workflows can conflict and reduce utility; not strictly additive (§C).
  • Adding both NL descriptions and filtered HTML increases context length and can harm performance; filtered HTML misses correct elements 47% of the time (§4.3).

When Not To Use

  • When tasks require highly interactive decision-making at each step (dynamic pop-ups or unpredictable intermediate choices).
  • When you cannot obtain any successful trajectories or a reasonably accurate automatic evaluator.
  • When prompt context budget is extremely tight and induced workflows cannot be compressed.

Failure Modes

  • Incorrectly induced workflows (from mis-evaluated runs) degrade downstream performance (§3.2.2).
  • Workflow actions are rigid and may skip necessary branching, causing failures in dynamic environments (§5).
  • Long combined context (NL + HTML + many workflows) overwhelms the LLM and reduces accuracy (§4.3).

Core Entities

Models

  • gpt-4
  • gpt-3.5-turbo

Metrics

  • task success rate
  • step success rate
  • Accuracy
  • action F1
  • average # steps

Datasets

  • WebArena
  • Mind2Web

Benchmarks

  • WebArena
  • Mind2Web

Context Entities

Models

  • AutoEval (neural evaluator)

Metrics

  • coverage
  • workflow utility rate
  • function overlap

Datasets

  • WebArena cross-template subset
  • Mind2Web cross-task / cross-website / cross-domain splits

Benchmarks

  • WebArena (execution-based)
  • Mind2Web (step-level evaluation)