Teach agents reusable web workflows from past traces to boost web-navigation success

September 11, 20248 min

Overview

Decision SnapshotNeeds Validation

The method is simple and demonstrated on two public benchmarks using GPT models; gains are clear but rely on LLM quality and evaluator correctness.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Inducing and reusing compact workflows turns past agent traces into practical, reusable skills that increase success rates and reduce execution steps on web automation tasks, saving time and API costs.

Who Should Care

Summary TLDR

The paper introduces Agent Workflow Memory (AWM): a lightweight system that extracts reusable sub-routines (workflows) from past agent trajectories and injects them into an agent's memory. AWM works offline (from training examples) or online (from successful test-time runs judged by an evaluator). On two web-navigation benchmarks it raises success rates substantially (WebArena: +12.0 absolute / +51.1% relative over a strong baseline; Mind2Web: +8.9 absolute step SR / +24.6% relative) while reducing the number of action steps. The method is implementation-light: workflows are text/code snippets stored in memory, induced by prompting an LLM, and used either as context or as callable high-level

Problem Statement

Current LM-based agents struggle on long, multi-step web tasks because they do not learn reusable routines between tasks and thus cannot adapt over time. The paper aims to extract common sub-routines from past trajectories and store them as workflows in agent memory so future tasks run faster and more reliably.

Main Contribution

Agent Workflow Memory (AWM): a method that induces reusable sub-routines (workflows) from past trajectories and adds them to agent memory.

Supports both offline induction from annotated examples and online, supervision-free induction from judged successful runs.

Key Findings

AWM raises overall success rate on WebArena versus a strong autonomous baseline.

NumbersAWM 35.5 SR vs BrowserGym 23.5 SR; +12.0 abs (+51.1% rel)

Practical UseIf you run web agents, adding induced workflows to agent memory can double-digit improve task success on realistic web tasks; try inducing site-level workflows from a few solved runs.

Evidence RefTable 1; §3.1.1

AWM improves step-wise success on Mind2Web by extracting reusable sub-routines.

NumbersStep SR AWM (gpt-4) 45.1 vs MindAct 36.2; +8.9 abs (+24.6% rel)

Practical UseReplacing raw-example context with abstract workflows helps the agent pick correct page elements more often; use workflows instead of copying full trajectories for better element selection.

Evidence RefTable 3; §3.2.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Total task success rate (WebArena, gpt-4)35.5% (AWM)23.5% (BrowserGym)+12.0 abs (+51.1% rel)WebArena (overall)Table 1; §3.1.1Table 1
Step success rate (Mind2Web cross-task, gpt-4)45.1% (AWM)36.2% (MindAct)+8.9 abs (+24.6% rel)Mind2Web cross-taskTable 3; §3.2.1Table 3

What To Try In 7 Days

Log several successful agent runs per site and prompt an LLM to extract common sub-routines as workflows.

Add those workflows to agent system prompts and re-run a small test set to measure task- and step-level success.

Run AWM online (streaming) on production queries to adapt to site drift without collecting new annotated data.

Agent Features

Memory
textual workflow memory (offline induced or streamed online)
Planning
observe-act loopcallable workflow actions (AWM AS)
Tool Use
primitive web actions (CLICK, TYPE, select, fill)workflow actions as high-level tools
Frameworks
BrowserGymAutoEval
Is Agentic

Yes

Architectures
LM-based agent (text backbone)

Optimization Features

Token Efficiency
Abstract workflows reduce example-specific context to save prompt space

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

WebArena (Zhou et al., 2024)Mind2Web (Deng et al., 2023)

Risks & Boundaries

Limitations

Online induction depends on correct automatic evaluation; wrong judgments can create harmful workflows (§2.3, §3.2).

Workflows can bias actions and lower action F1 when the workflow is not perfectly relevant to the current state (§3.2.1).

When Not To Use

When tasks require highly interactive decision-making at each step (dynamic pop-ups or unpredictable intermediate choices).

When you cannot obtain any successful trajectories or a reasonably accurate automatic evaluator.

Failure Modes

Incorrectly induced workflows (from mis-evaluated runs) degrade downstream performance (§3.2.2).

Workflow actions are rigid and may skip necessary branching, causing failures in dynamic environments (§5).

Core Entities

Models

gpt-4gpt-3.5-turbo

Metrics

task success ratestep success rateAccuracyaction F1average # steps

Datasets

WebArenaMind2Web

Benchmarks

WebArenaMind2Web

Context Entities

Models

AutoEval (neural evaluator)

Metrics

coverageworkflow utility ratefunction overlap

Datasets

WebArena cross-template subsetMind2Web cross-task / cross-website / cross-domain splits

Benchmarks

WebArena (execution-based)Mind2Web (step-level evaluation)