Overview
The method is conceptually simple and integrates with existing LLM/VLM agents; results are strong on simulators but require episode logs, embedding infrastructure, and some environment-specific tuning.
Citations5
Evidence Strength0.80
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
RAP turns past successful runs into reusable context that raises accuracy for multi-step text and embodied agents, reducing trial-and-error and speeding up deployment in web automation and robotic workflows.
Who Should Care
Summary TLDR
RAP stores short episode logs (plans, actions, observations) and retrieves the most relevant snippets to feed as in-context examples to an LLM agent. Retrieval scores combine task, plan, and a generated retrieval key; retrieved windows center on the most similar action. RAP improves multi-step text agents and multimodal embodied agents across ALFWorld, WebShop, Franka Kitchen and Meta-World, often substantially outperforming ReAct and other baselines. Memory can be built from one model and used by another.
Problem Statement
LLM agents struggle to reuse relevant past experiences when planning multi-step actions, especially in multimodal and embodied tasks. There is no unified way to store, retrieve, and integrate multimodal episode memories to improve sequential decision-making.
Main Contribution
RAP: a modular retrieval-augmented planning pipeline (Memory, Reasoner, Retriever, Executor) that stores episodes and retrieves context-aware examples to the agent.
Multimodal memory and retrieval: supports text and image observations via sentence embeddings and CLIP/Vision Transformer features.
Key Findings
RAP raises ALFWorld success from 52.2% (ReAct, GPT-3.5) to 85.8% with GPT-3.5.
RAP with memory built from training tasks reaches 91.0% success on ALFWorld (GPT-3.5).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ALFWorld success rate (All) | 85.8% (RAP, GPT-3.5) | 52.2% (ReAct, GPT-3.5) | +33.6 pp | ALFWorld (134 unseen games) | Table 1 reports RAP 85.8% vs ReAct 52.2% with GPT-3.5 | Table 1 |
| ALFWorld success rate (RAP_train) | 91.0% (RAP_train, GPT-3.5) | 52.2% (ReAct, GPT-3.5) | +38.8 pp | ALFWorld (with memory from training set) | Table 1 shows RAP_train 91.0% | Table 1 |
What To Try In 7 Days
Log successful episodes (task, plan, actions, observations) from your agent.
Add a retrieval step that computes similarity across task, plan, and a retrieval key.
Feed top-k nearby trajectory windows as in-context examples and compare success rate vs baseline.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires a corpus of successful episodes; performance drops if similar past examples are missing.
Retrieval quality depends on embedding models and crafted retrieval keys.
When Not To Use
When you have no past successful episodes to store.
When privacy or compliance forbids storing action/observation logs.
Failure Modes
Retrieving irrelevant or misleading episodes that prompt incorrect actions.
Overfitting to common trajectories in memory and failing novel cases.

