Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Caching past page states and corrective traces cuts live web interactions and increases task reliability, lowering latency and operational cost for automated customer-service or data-extraction agents.
Summary TLDR
R2D2 is a system for web agents that builds a directed replay graph of past page observations (Remember) and stores corrected partial trajectories with explanations (Reflect). At inference it retrieves relevant corrected traces as in-context demos, uses an LLM-guided A* search over the replay graph, and reduces unnecessary online steps. On the WebArena benchmark R2D2 achieves 27.3% overall success rate and cuts average online steps to 13.1, while substantially reducing navigation failures. The method improves reliability when agents face repeated or similar web tasks but relies on offline memory construction and an LLM retriever.
Problem Statement
Current LLM-driven web agents forget trajectories and treat web navigation as an unknown process. This causes frequent navigation failures and repeated exploration. The paper asks: can we store and reuse past experiences to turn web navigation into a known search problem and then reflect on execution mistakes to improve future actions?
Main Contribution
A Remember paradigm: build a directed replay graph of observed web pages and actions to reconstruct environment structure.
A Reflect paradigm: truncate failed trajectories at first error, generate corrective rationales, and store corrected partial trajectories in a reflective key-value memory.
An inference pipeline that retrieves corrected trajectories as in-context demonstrations and runs LLM-guided A* search on the replay graph to produce efficient, higher-quality trajectories.
Key Findings
R2D2 improves overall task success on WebArena compared to baselines.
R2D2 reduces the average number of online steps required to complete tasks.
Navigation errors are the dominant source of failures for vanilla agents and R2D2 cuts them substantially.
Learning only from successful trajectories hurts performance.
Results
Total success rate (SR)
Domain SR (example: CMS)
Average online steps per successful task
Effect of using only successful trajectories
Who Should Care
What To Try In 7 Days
Log and index agent trajectories as a directed graph for repeated domains.
Store truncated failed trajectories with short corrective notes as key-value entries.
Use a dense retriever (MiniLM) to fetch relevant past traces as in-context demos before taking online steps.
Agent Features
Memory
- Replay buffer graph (nodes=observations, edges=actions)
- Reflective key-value memory (query vectors → corrected trajectories)
Planning
- A* best-first search with LLM heuristic
- Ranked trajectory selection by LLM
Tool Use
- Dense retrieval for memory lookup
- In-context demonstrations (retrieved trajectories)
- Replay buffer for cached navigation
Frameworks
- ReACT (used as base agent)
- retriv (indexing/retrieval)
Is Agentic
true
Architectures
- LLM-guided A* search
- Directed replay graph + key-value reflective store
Collaboration
- Single-agent system using past episodes
Optimization Features
System Optimization
- Eviction policy to bound replay buffer size
Inference Optimization
- Reduce online web steps via cached replay buffer
- LLM used sparingly for heuristics and ranking
Reproducibility
Data Urls
- WebArena (Zhou et al., 2024b)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments only in English; multilingual behavior unknown (§Limitations).
- Single benchmark (WebArena) and a single main LLM (gpt-4o) limit generality (§Limitations).
- Memory relies on relatively stable page structure; highly dynamic sites may invalidate cached graphs.
When Not To Use
- Environments with one-off pages and no repeated queries (no replay benefit).
- Very low-resource settings where LLM costs make offline indexing expensive.
- Highly dynamic websites where prior page states quickly become stale.
Failure Modes
- Pessimistic reflection: reflection can prematurely give up (30.3% of execution errors, Appx. A).
- GUI misunderstanding: agent misidentifies or mis-clicks UI elements (24.2%, Appx. A).
- Difficulty executing complex multi-step plans even after reaching target pages (20.2%, Appx. A).
Core Entities
Models
- gpt-4o
Metrics
- Success Rate (SR)
- Navigation error rate
- Average online steps
Datasets
- WebArena
Benchmarks
- WebArena
Context Entities
Models
- ReACT (baseline)
- GPT-4o
Metrics
- SR by domain (CMS, Reddit, Shopping, Map, GitLab)
- Steps per successful task
Datasets
- WebArena (Zhou et al., 2024b)
Benchmarks
- WebArena

