Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.45
Citation Count
1
Why It Matters For Business
Inducing and reusing compact workflows turns past agent traces into practical, reusable skills that increase success rates and reduce execution steps on web automation tasks, saving time and API costs.
Summary TLDR
The paper introduces Agent Workflow Memory (AWM): a lightweight system that extracts reusable sub-routines (workflows) from past agent trajectories and injects them into an agent's memory. AWM works offline (from training examples) or online (from successful test-time runs judged by an evaluator). On two web-navigation benchmarks it raises success rates substantially (WebArena: +12.0 absolute / +51.1% relative over a strong baseline; Mind2Web: +8.9 absolute step SR / +24.6% relative) while reducing the number of action steps. The method is implementation-light: workflows are text/code snippets stored in memory, induced by prompting an LLM, and used either as context or as callable high-level
Problem Statement
Current LM-based agents struggle on long, multi-step web tasks because they do not learn reusable routines between tasks and thus cannot adapt over time. The paper aims to extract common sub-routines from past trajectories and store them as workflows in agent memory so future tasks run faster and more reliably.
Main Contribution
Agent Workflow Memory (AWM): a method that induces reusable sub-routines (workflows) from past trajectories and adds them to agent memory.
Supports both offline induction from annotated examples and online, supervision-free induction from judged successful runs.
Shows large, practical gains on two web-navigation benchmarks (WebArena and Mind2Web) with modest implementation changes.
Open-source code and prompts released to reproduce induction and usage: https://github.com/zorazrw/agent-workflow-memory
Key Findings
AWM raises overall success rate on WebArena versus a strong autonomous baseline.
AWM improves step-wise success on Mind2Web by extracting reusable sub-routines.
AWM reduces the number of actions needed to reach goals.
Online AWM generalizes better as train-test distribution gaps grow.
Results
Total task success rate (WebArena, gpt-4)
Step success rate (Mind2Web cross-task, gpt-4)
Accuracy
Average steps per successful task (WebArena, gpt-4)
Who Should Care
What To Try In 7 Days
Log several successful agent runs per site and prompt an LLM to extract common sub-routines as workflows.
Add those workflows to agent system prompts and re-run a small test set to measure task- and step-level success.
Run AWM online (streaming) on production queries to adapt to site drift without collecting new annotated data.
Agent Features
Memory
- textual workflow memory (offline induced or streamed online)
Planning
- observe-act loop
- callable workflow actions (AWM AS)
Tool Use
- primitive web actions (CLICK, TYPE, select, fill)
- workflow actions as high-level tools
Frameworks
- BrowserGym
- AutoEval
Is Agentic
true
Architectures
- LM-based agent (text backbone)
Optimization Features
Token Efficiency
- Abstract workflows reduce example-specific context to save prompt space
Reproducibility
Data Urls
- WebArena (Zhou et al., 2024)
- Mind2Web (Deng et al., 2023)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Online induction depends on correct automatic evaluation; wrong judgments can create harmful workflows (§2.3, §3.2).
- Workflows can bias actions and lower action F1 when the workflow is not perfectly relevant to the current state (§3.2.1).
- Workflow actions are brittle under dynamic intermediate states (pop-up choices), limiting callable-action utility (§5).
- Combining offline and online workflows can conflict and reduce utility; not strictly additive (§C).
- Adding both NL descriptions and filtered HTML increases context length and can harm performance; filtered HTML misses correct elements 47% of the time (§4.3).
When Not To Use
- When tasks require highly interactive decision-making at each step (dynamic pop-ups or unpredictable intermediate choices).
- When you cannot obtain any successful trajectories or a reasonably accurate automatic evaluator.
- When prompt context budget is extremely tight and induced workflows cannot be compressed.
Failure Modes
- Incorrectly induced workflows (from mis-evaluated runs) degrade downstream performance (§3.2.2).
- Workflow actions are rigid and may skip necessary branching, causing failures in dynamic environments (§5).
- Long combined context (NL + HTML + many workflows) overwhelms the LLM and reduces accuracy (§4.3).
Core Entities
Models
- gpt-4
- gpt-3.5-turbo
Metrics
- task success rate
- step success rate
- Accuracy
- action F1
- average # steps
Datasets
- WebArena
- Mind2Web
Benchmarks
- WebArena
- Mind2Web
Context Entities
Models
- AutoEval (neural evaluator)
Metrics
- coverage
- workflow utility rate
- function overlap
Datasets
- WebArena cross-template subset
- Mind2Web cross-task / cross-website / cross-domain splits
Benchmarks
- WebArena (execution-based)
- Mind2Web (step-level evaluation)

