Overview
The approach is practical: it uses an LLM retriever and a replay graph to reduce live interactions and boost reliability, but results come from one benchmark and one LLM, so expect variable gains when porting to different domains or languages.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Caching past page states and corrective traces cuts live web interactions and increases task reliability, lowering latency and operational cost for automated customer-service or data-extraction agents.
Who Should Care
Summary TLDR
R2D2 is a system for web agents that builds a directed replay graph of past page observations (Remember) and stores corrected partial trajectories with explanations (Reflect). At inference it retrieves relevant corrected traces as in-context demos, uses an LLM-guided A* search over the replay graph, and reduces unnecessary online steps. On the WebArena benchmark R2D2 achieves 27.3% overall success rate and cuts average online steps to 13.1, while substantially reducing navigation failures. The method improves reliability when agents face repeated or similar web tasks but relies on offline memory construction and an LLM retriever.
Problem Statement
Current LLM-driven web agents forget trajectories and treat web navigation as an unknown process. This causes frequent navigation failures and repeated exploration. The paper asks: can we store and reuse past experiences to turn web navigation into a known search problem and then reflect on execution mistakes to improve future actions?
Main Contribution
A Remember paradigm: build a directed replay graph of observed web pages and actions to reconstruct environment structure.
A Reflect paradigm: truncate failed trajectories at first error, generate corrective rationales, and store corrected partial trajectories in a reflective key-value memory.
Key Findings
R2D2 improves overall task success on WebArena compared to baselines.
R2D2 reduces the average number of online steps required to complete tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Total success rate (SR) | R2D2 27.3% | Tree-Search 19.0%, ReACT 13.1% | ↑ 8.3 pp vs Tree-Search | WebArena (all domains) | Table 1 reports SR across domains. | Table 1 |
| Domain SR (example: CMS) | CMS 30% (R2D2) | Tree-Search 17% | ↑ 13 pp on CMS | WebArena - CMS subset | Table 1 per-domain SR. | Table 1 |
What To Try In 7 Days
Log and index agent trajectories as a directed graph for repeated domains.
Store truncated failed trajectories with short corrective notes as key-value entries.
Use a dense retriever (MiniLM) to fetch relevant past traces as in-context demos before taking online steps.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments only in English; multilingual behavior unknown (§Limitations).
Single benchmark (WebArena) and a single main LLM (gpt-4o) limit generality (§Limitations).
When Not To Use
Environments with one-off pages and no repeated queries (no replay benefit).
Very low-resource settings where LLM costs make offline indexing expensive.
Failure Modes
Pessimistic reflection: reflection can prematurely give up (30.3% of execution errors, Appx. A).
GUI misunderstanding: agent misidentifies or mis-clicks UI elements (24.2%, Appx. A).

