Teach agents reusable web workflows from past traces to boost web-navigation success

Overview

Decision SnapshotNeeds Validation

The method is simple and demonstrated on two public benchmarks using GPT models; gains are clear but rely on LLM quality and evaluator correctness.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Inducing and reusing compact workflows turns past agent traces into practical, reusable skills that increase success rates and reduce execution steps on web automation tasks, saving time and API costs.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper introduces Agent Workflow Memory (AWM): a lightweight system that extracts reusable sub-routines (workflows) from past agent trajectories and injects them into an agent's memory. AWM works offline (from training examples) or online (from successful test-time runs judged by an evaluator). On two web-navigation benchmarks it raises success rates substantially (WebArena: +12.0 absolute / +51.1% relative over a strong baseline; Mind2Web: +8.9 absolute step SR / +24.6% relative) while reducing the number of action steps. The method is implementation-light: workflows are text/code snippets stored in memory, induced by prompting an LLM, and used either as context or as callable high-level

Problem Statement

Current LM-based agents struggle on long, multi-step web tasks because they do not learn reusable routines between tasks and thus cannot adapt over time. The paper aims to extract common sub-routines from past trajectories and store them as workflows in agent memory so future tasks run faster and more reliably.

Main Contribution

Agent Workflow Memory (AWM): a method that induces reusable sub-routines (workflows) from past trajectories and adds them to agent memory.

Supports both offline induction from annotated examples and online, supervision-free induction from judged successful runs.

Key Findings

AWM raises overall success rate on WebArena versus a strong autonomous baseline.

NumbersAWM 35.5 SR vs BrowserGym 23.5 SR; +12.0 abs (+51.1% rel)

Practical UseIf you run web agents, adding induced workflows to agent memory can double-digit improve task success on realistic web tasks; try inducing site-level workflows from a few solved runs.

Evidence RefTable 1; §3.1.1

AWM improves step-wise success on Mind2Web by extracting reusable sub-routines.

NumbersStep SR AWM (gpt-4) 45.1 vs MindAct 36.2; +8.9 abs (+24.6% rel)

Practical UseReplacing raw-example context with abstract workflows helps the agent pick correct page elements more often; use workflows instead of copying full trajectories for better element selection.

Evidence RefTable 3; §3.2.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Total task success rate (WebArena, gpt-4)	35.5% (AWM)	23.5% (BrowserGym)	+12.0 abs (+51.1% rel)	WebArena (overall)	Table 1; §3.1.1	Table 1
Step success rate (Mind2Web cross-task, gpt-4)	45.1% (AWM)	36.2% (MindAct)	+8.9 abs (+24.6% rel)	Mind2Web cross-task	Table 3; §3.2.1	Table 3

What To Try In 7 Days

Log several successful agent runs per site and prompt an LLM to extract common sub-routines as workflows.

Add those workflows to agent system prompts and re-run a small test set to measure task- and step-level success.

Run AWM online (streaming) on production queries to adapt to site drift without collecting new annotated data.

Agent Features

Memory

textual workflow memory (offline induced or streamed online)

Planning

observe-act loopcallable workflow actions (AWM AS)

Tool Use

primitive web actions (CLICK, TYPE, select, fill)workflow actions as high-level tools

Frameworks

BrowserGymAutoEval

Is Agentic

Yes

Architectures

LM-based agent (text backbone)

Optimization Features

Token Efficiency

Abstract workflows reduce example-specific context to save prompt space

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/zorazrw/agent-workflow-memory

Data URLs

WebArena (Zhou et al., 2024)Mind2Web (Deng et al., 2023)

Risks & Boundaries

Limitations

Online induction depends on correct automatic evaluation; wrong judgments can create harmful workflows (§2.3, §3.2).

Workflows can bias actions and lower action F1 when the workflow is not perfectly relevant to the current state (§3.2.1).

When Not To Use

When tasks require highly interactive decision-making at each step (dynamic pop-ups or unpredictable intermediate choices).

When you cannot obtain any successful trajectories or a reasonably accurate automatic evaluator.

Failure Modes

Incorrectly induced workflows (from mis-evaluated runs) degrade downstream performance (§3.2.2).

Workflow actions are rigid and may skip necessary branching, causing failures in dynamic environments (§5).

Core Entities

Models

gpt-4gpt-3.5-turbo

Metrics

task success ratestep success rateAccuracyaction F1average # steps

Datasets

WebArenaMind2Web

Benchmarks

WebArenaMind2Web

Context Entities

Models

AutoEval (neural evaluator)

Metrics

coverageworkflow utility ratefunction overlap

Datasets

WebArena cross-template subsetMind2Web cross-task / cross-website / cross-domain splits

Benchmarks

WebArena (execution-based)Mind2Web (step-level evaluation)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AWM raises overall success rate on WebArena versus a strong autonomous baseline.

AWM improves step-wise success on Mind2Web by extracting reusable sub-routines.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding