Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
5
Why It Matters For Business
RAP turns past successful runs into reusable context that raises accuracy for multi-step text and embodied agents, reducing trial-and-error and speeding up deployment in web automation and robotic workflows.
Summary TLDR
RAP stores short episode logs (plans, actions, observations) and retrieves the most relevant snippets to feed as in-context examples to an LLM agent. Retrieval scores combine task, plan, and a generated retrieval key; retrieved windows center on the most similar action. RAP improves multi-step text agents and multimodal embodied agents across ALFWorld, WebShop, Franka Kitchen and Meta-World, often substantially outperforming ReAct and other baselines. Memory can be built from one model and used by another.
Problem Statement
LLM agents struggle to reuse relevant past experiences when planning multi-step actions, especially in multimodal and embodied tasks. There is no unified way to store, retrieve, and integrate multimodal episode memories to improve sequential decision-making.
Main Contribution
RAP: a modular retrieval-augmented planning pipeline (Memory, Reasoner, Retriever, Executor) that stores episodes and retrieves context-aware examples to the agent.
Multimodal memory and retrieval: supports text and image observations via sentence embeddings and CLIP/Vision Transformer features.
Empirical gains across four benchmarks (ALFWorld, WebShop, Franka Kitchen, Meta-World) and multiple LLM/VLM backbones; memory is transferable across models.
Key Findings
RAP raises ALFWorld success from 52.2% (ReAct, GPT-3.5) to 85.8% with GPT-3.5.
RAP with memory built from training tasks reaches 91.0% success on ALFWorld (GPT-3.5).
On WebShop, RAP improves success rate from 35.0% to 48.0% and overall reward from 61.8 to 76.1 (GPT-3.5).
In embodied vision tasks, RAP raises Franka Kitchen success for LLaVA from 43.4% to 61.6% and Meta-World from 65.4% to 79.2%.
Memory transfers across models: Llama2-13b success rises from 20.9% to 27.6% using memory constructed by GPT-3.5.
Ablations show retrieval choice matters: using observation-based retrieval raises ALFWorld success from 82.1% (action-only) to 86.6% when using CLIP image retrieval.
Results
ALFWorld success rate (All)
ALFWorld success rate (RAP_train)
WebShop success rate
WebShop overall score
Franka Kitchen success rate
Meta-World success rate
Who Should Care
What To Try In 7 Days
Log successful episodes (task, plan, actions, observations) from your agent.
Add a retrieval step that computes similarity across task, plan, and a retrieval key.
Feed top-k nearby trajectory windows as in-context examples and compare success rate vs baseline.
Agent Features
Memory
- episodic memory (stored episodes of successful runs)
- multimodal memory (text + images)
- retrieval keys (short phrases derived from action plans)
Planning
- in-context planning
- ReAct-style interleaved reasoning and acting
Tool Use
- environment actions (text commands / simulated controls)
- policy network for low-level action mapping
Frameworks
- ReAct
- RAG
- Reflexion
- ADaPT
- ExpeL
Is Agentic
true
Architectures
- LLM agent
- VLM + policy network (multimodal agent)
Optimization Features
Token Efficiency
- retrieve small windows around most-similar action to reduce context
Training Optimization
- few-shot policy training for mapping plans to actions (25 demos)
Inference Optimization
- limit retrieved window size to reduce prompt noise
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires a corpus of successful episodes; performance drops if similar past examples are missing.
- Retrieval quality depends on embedding models and crafted retrieval keys.
- Experiments run in simulation; real-robot transfer and real web systems may incur additional challenges.
When Not To Use
- When you have no past successful episodes to store.
- When privacy or compliance forbids storing action/observation logs.
- When inference cost prevents extra embedding and retrieval steps.
Failure Modes
- Retrieving irrelevant or misleading episodes that prompt incorrect actions.
- Overfitting to common trajectories in memory and failing novel cases.
- Embedding mismatch between models producing memory and models consuming it.
Core Entities
Models
- GPT-3.5
- GPT-4
- Llama2-13b
- LLaVA-13B
- CogVLM-17B
- Vicuna-v1.5
Metrics
- success rate
- overall reward/score
Datasets
- ALFWorld
- WebShop
- Franka Kitchen
- Meta-World
Benchmarks
- ALFWorld
- WebShop
- Franka Kitchen
- Meta-World

