Use past successful episodes as memory to boost LLM agent planning in text and vision tasks

February 6, 20248 min

Overview

Decision SnapshotReady For Pilot

The method is conceptually simple and integrates with existing LLM/VLM agents; results are strong on simulators but require episode logs, embedding infrastructure, and some environment-specific tuning.

Citations5

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, Yang You

Links

Abstract / PDF

Why It Matters For Business

RAP turns past successful runs into reusable context that raises accuracy for multi-step text and embodied agents, reducing trial-and-error and speeding up deployment in web automation and robotic workflows.

Who Should Care

Summary TLDR

RAP stores short episode logs (plans, actions, observations) and retrieves the most relevant snippets to feed as in-context examples to an LLM agent. Retrieval scores combine task, plan, and a generated retrieval key; retrieved windows center on the most similar action. RAP improves multi-step text agents and multimodal embodied agents across ALFWorld, WebShop, Franka Kitchen and Meta-World, often substantially outperforming ReAct and other baselines. Memory can be built from one model and used by another.

Problem Statement

LLM agents struggle to reuse relevant past experiences when planning multi-step actions, especially in multimodal and embodied tasks. There is no unified way to store, retrieve, and integrate multimodal episode memories to improve sequential decision-making.

Main Contribution

RAP: a modular retrieval-augmented planning pipeline (Memory, Reasoner, Retriever, Executor) that stores episodes and retrieves context-aware examples to the agent.

Multimodal memory and retrieval: supports text and image observations via sentence embeddings and CLIP/Vision Transformer features.

Key Findings

RAP raises ALFWorld success from 52.2% (ReAct, GPT-3.5) to 85.8% with GPT-3.5.

Numbers52.2%85.8% (ALFWorld, Table 1)

Practical UseAdd episode memory and retrieval to multi-step text agents to roughly double success on ALFWorld-like tasks.

Evidence RefTable 1

RAP with memory built from training tasks reaches 91.0% success on ALFWorld (GPT-3.5).

NumbersRAP_train 91.0% (Table 1)

Practical UseConstructing a curated memory from successful runs can yield further accuracy gains over on-the-fly memory.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ALFWorld success rate (All)85.8% (RAP, GPT-3.5)52.2% (ReAct, GPT-3.5)+33.6 ppALFWorld (134 unseen games)Table 1 reports RAP 85.8% vs ReAct 52.2% with GPT-3.5Table 1
ALFWorld success rate (RAP_train)91.0% (RAP_train, GPT-3.5)52.2% (ReAct, GPT-3.5)+38.8 ppALFWorld (with memory from training set)Table 1 shows RAP_train 91.0%Table 1

What To Try In 7 Days

Log successful episodes (task, plan, actions, observations) from your agent.

Add a retrieval step that computes similarity across task, plan, and a retrieval key.

Feed top-k nearby trajectory windows as in-context examples and compare success rate vs baseline.

Agent Features

Memory
episodic memory (stored episodes of successful runs)multimodal memory (text + images)retrieval keys (short phrases derived from action plans)
Planning
in-context planningReAct-style interleaved reasoning and acting
Tool Use
environment actions (text commands / simulated controls)policy network for low-level action mapping
Frameworks
ReActRAGReflexionADaPTExpeL
Is Agentic

Yes

Architectures
LLM agentVLM + policy network (multimodal agent)

Optimization Features

Token Efficiency
retrieve small windows around most-similar action to reduce context
Training Optimization
few-shot policy training for mapping plans to actions (25 demos)
Inference Optimization
limit retrieved window size to reduce prompt noise

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Requires a corpus of successful episodes; performance drops if similar past examples are missing.

Retrieval quality depends on embedding models and crafted retrieval keys.

When Not To Use

When you have no past successful episodes to store.

When privacy or compliance forbids storing action/observation logs.

Failure Modes

Retrieving irrelevant or misleading episodes that prompt incorrect actions.

Overfitting to common trajectories in memory and failing novel cases.

Core Entities

Models

GPT-3.5GPT-4Llama2-13bLLaVA-13BCogVLM-17BVicuna-v1.5

Metrics

success rateoverall reward/score

Datasets

ALFWorldWebShopFranka KitchenMeta-World

Benchmarks

ALFWorldWebShopFranka KitchenMeta-World