Use a replay graph + reflective memory to turn web navigation from guesswork into a searchable map.

January 21, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Tenghao Huang, Kinjal Basu, Ibrahim Abdelaziz, Pavan Kapanipathi, Jonathan May, Muhao Chen

Links

Abstract / PDF

Why It Matters For Business

Caching past page states and corrective traces cuts live web interactions and increases task reliability, lowering latency and operational cost for automated customer-service or data-extraction agents.

Summary TLDR

R2D2 is a system for web agents that builds a directed replay graph of past page observations (Remember) and stores corrected partial trajectories with explanations (Reflect). At inference it retrieves relevant corrected traces as in-context demos, uses an LLM-guided A* search over the replay graph, and reduces unnecessary online steps. On the WebArena benchmark R2D2 achieves 27.3% overall success rate and cuts average online steps to 13.1, while substantially reducing navigation failures. The method improves reliability when agents face repeated or similar web tasks but relies on offline memory construction and an LLM retriever.

Problem Statement

Current LLM-driven web agents forget trajectories and treat web navigation as an unknown process. This causes frequent navigation failures and repeated exploration. The paper asks: can we store and reuse past experiences to turn web navigation into a known search problem and then reflect on execution mistakes to improve future actions?

Main Contribution

A Remember paradigm: build a directed replay graph of observed web pages and actions to reconstruct environment structure.

A Reflect paradigm: truncate failed trajectories at first error, generate corrective rationales, and store corrected partial trajectories in a reflective key-value memory.

An inference pipeline that retrieves corrected trajectories as in-context demonstrations and runs LLM-guided A* search on the replay graph to produce efficient, higher-quality trajectories.

Key Findings

R2D2 improves overall task success on WebArena compared to baselines.

NumbersTotal SR R2D2 27.3% vs Tree-Search 19.0% (Table 1)

R2D2 reduces the average number of online steps required to complete tasks.

NumbersAvg steps R2D2 13.1 vs Tree-Search 33.8 (Table 2)

Navigation errors are the dominant source of failures for vanilla agents and R2D2 cuts them substantially.

NumbersVanilla agent ~60% errors are navigation failures; R2D2 attributes 75% of early gain to fixing navigation (Fig.6, §4.3)

Learning only from successful trajectories hurts performance.

NumbersVariant trained only on successes falls by 6.8% SR to 20.5% (§4.3)

Results

Total success rate (SR)

ValueR2D2 27.3%

BaselineTree-Search 19.0%, ReACT 13.1%

Domain SR (example: CMS)

ValueCMS 30% (R2D2)

BaselineTree-Search 17%

Average online steps per successful task

ValueR2D2 13.1 steps

BaselineTree-Search 33.8 steps

Effect of using only successful trajectories

ValueSR drops to 20.5%

BaselineFull R2D2 27.3%

Who Should Care

What To Try In 7 Days

Log and index agent trajectories as a directed graph for repeated domains.

Store truncated failed trajectories with short corrective notes as key-value entries.

Use a dense retriever (MiniLM) to fetch relevant past traces as in-context demos before taking online steps.

Agent Features

Memory

  • Replay buffer graph (nodes=observations, edges=actions)
  • Reflective key-value memory (query vectors → corrected trajectories)

Planning

  • A* best-first search with LLM heuristic
  • Ranked trajectory selection by LLM

Tool Use

  • Dense retrieval for memory lookup
  • In-context demonstrations (retrieved trajectories)
  • Replay buffer for cached navigation

Frameworks

  • ReACT (used as base agent)
  • retriv (indexing/retrieval)

Is Agentic

true

Architectures

  • LLM-guided A* search
  • Directed replay graph + key-value reflective store

Collaboration

  • Single-agent system using past episodes

Optimization Features

System Optimization

  • Eviction policy to bound replay buffer size

Inference Optimization

  • Reduce online web steps via cached replay buffer
  • LLM used sparingly for heuristics and ranking

Reproducibility

Data Urls

  • WebArena (Zhou et al., 2024b)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments only in English; multilingual behavior unknown (§Limitations).
  • Single benchmark (WebArena) and a single main LLM (gpt-4o) limit generality (§Limitations).
  • Memory relies on relatively stable page structure; highly dynamic sites may invalidate cached graphs.

When Not To Use

  • Environments with one-off pages and no repeated queries (no replay benefit).
  • Very low-resource settings where LLM costs make offline indexing expensive.
  • Highly dynamic websites where prior page states quickly become stale.

Failure Modes

  • Pessimistic reflection: reflection can prematurely give up (30.3% of execution errors, Appx. A).
  • GUI misunderstanding: agent misidentifies or mis-clicks UI elements (24.2%, Appx. A).
  • Difficulty executing complex multi-step plans even after reaching target pages (20.2%, Appx. A).

Core Entities

Models

  • gpt-4o

Metrics

  • Success Rate (SR)
  • Navigation error rate
  • Average online steps

Datasets

  • WebArena

Benchmarks

  • WebArena

Context Entities

Models

  • ReACT (baseline)
  • GPT-4o

Metrics

  • SR by domain (CMS, Reddit, Shopping, Map, GitLab)
  • Steps per successful task

Datasets

  • WebArena (Zhou et al., 2024b)

Benchmarks

  • WebArena