Overview
The method is simple and demonstrated on two public benchmarks using GPT models; gains are clear but rely on LLM quality and evaluator correctness.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
Inducing and reusing compact workflows turns past agent traces into practical, reusable skills that increase success rates and reduce execution steps on web automation tasks, saving time and API costs.
Who Should Care
Summary TLDR
The paper introduces Agent Workflow Memory (AWM): a lightweight system that extracts reusable sub-routines (workflows) from past agent trajectories and injects them into an agent's memory. AWM works offline (from training examples) or online (from successful test-time runs judged by an evaluator). On two web-navigation benchmarks it raises success rates substantially (WebArena: +12.0 absolute / +51.1% relative over a strong baseline; Mind2Web: +8.9 absolute step SR / +24.6% relative) while reducing the number of action steps. The method is implementation-light: workflows are text/code snippets stored in memory, induced by prompting an LLM, and used either as context or as callable high-level
Problem Statement
Current LM-based agents struggle on long, multi-step web tasks because they do not learn reusable routines between tasks and thus cannot adapt over time. The paper aims to extract common sub-routines from past trajectories and store them as workflows in agent memory so future tasks run faster and more reliably.
Main Contribution
Agent Workflow Memory (AWM): a method that induces reusable sub-routines (workflows) from past trajectories and adds them to agent memory.
Supports both offline induction from annotated examples and online, supervision-free induction from judged successful runs.
Key Findings
AWM raises overall success rate on WebArena versus a strong autonomous baseline.
AWM improves step-wise success on Mind2Web by extracting reusable sub-routines.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Total task success rate (WebArena, gpt-4) | 35.5% (AWM) | 23.5% (BrowserGym) | +12.0 abs (+51.1% rel) | WebArena (overall) | Table 1; §3.1.1 | Table 1 |
| Step success rate (Mind2Web cross-task, gpt-4) | 45.1% (AWM) | 36.2% (MindAct) | +8.9 abs (+24.6% rel) | Mind2Web cross-task | Table 3; §3.2.1 | Table 3 |
What To Try In 7 Days
Log several successful agent runs per site and prompt an LLM to extract common sub-routines as workflows.
Add those workflows to agent system prompts and re-run a small test set to measure task- and step-level success.
Run AWM online (streaming) on production queries to adapt to site drift without collecting new annotated data.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Online induction depends on correct automatic evaluation; wrong judgments can create harmful workflows (§2.3, §3.2).
Workflows can bias actions and lower action F1 when the workflow is not perfectly relevant to the current state (§3.2.1).
When Not To Use
When tasks require highly interactive decision-making at each step (dynamic pop-ups or unpredictable intermediate choices).
When you cannot obtain any successful trajectories or a reasonably accurate automatic evaluator.
Failure Modes
Incorrectly induced workflows (from mis-evaluated runs) degrade downstream performance (§3.2.2).
Workflow actions are rigid and may skip necessary branching, causing failures in dynamic environments (§5).

