Use past successful episodes as memory to boost LLM agent planning in text and vision tasks

Overview

Decision SnapshotReady For Pilot

The method is conceptually simple and integrates with existing LLM/VLM agents; results are strong on simulators but require episode logs, embedding infrastructure, and some environment-specific tuning.

Citations5

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, Yang You

Links

Abstract / PDF

Why It Matters For Business

RAP turns past successful runs into reusable context that raises accuracy for multi-step text and embodied agents, reducing trial-and-error and speeding up deployment in web automation and robotic workflows.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist CTO Founder

Summary TLDR

RAP stores short episode logs (plans, actions, observations) and retrieves the most relevant snippets to feed as in-context examples to an LLM agent. Retrieval scores combine task, plan, and a generated retrieval key; retrieved windows center on the most similar action. RAP improves multi-step text agents and multimodal embodied agents across ALFWorld, WebShop, Franka Kitchen and Meta-World, often substantially outperforming ReAct and other baselines. Memory can be built from one model and used by another.

Problem Statement

LLM agents struggle to reuse relevant past experiences when planning multi-step actions, especially in multimodal and embodied tasks. There is no unified way to store, retrieve, and integrate multimodal episode memories to improve sequential decision-making.

Main Contribution

RAP: a modular retrieval-augmented planning pipeline (Memory, Reasoner, Retriever, Executor) that stores episodes and retrieves context-aware examples to the agent.

Multimodal memory and retrieval: supports text and image observations via sentence embeddings and CLIP/Vision Transformer features.

Key Findings

RAP raises ALFWorld success from 52.2% (ReAct, GPT-3.5) to 85.8% with GPT-3.5.

Numbers52.2% → 85.8% (ALFWorld, Table 1)

Practical UseAdd episode memory and retrieval to multi-step text agents to roughly double success on ALFWorld-like tasks.

Evidence RefTable 1

RAP with memory built from training tasks reaches 91.0% success on ALFWorld (GPT-3.5).

NumbersRAP_train 91.0% (Table 1)

Practical UseConstructing a curated memory from successful runs can yield further accuracy gains over on-the-fly memory.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ALFWorld success rate (All)	85.8% (RAP, GPT-3.5)	52.2% (ReAct, GPT-3.5)	+33.6 pp	ALFWorld (134 unseen games)	Table 1 reports RAP 85.8% vs ReAct 52.2% with GPT-3.5	Table 1
ALFWorld success rate (RAP_train)	91.0% (RAP_train, GPT-3.5)	52.2% (ReAct, GPT-3.5)	+38.8 pp	ALFWorld (with memory from training set)	Table 1 shows RAP_train 91.0%	Table 1

What To Try In 7 Days

Log successful episodes (task, plan, actions, observations) from your agent.

Add a retrieval step that computes similarity across task, plan, and a retrieval key.

Feed top-k nearby trajectory windows as in-context examples and compare success rate vs baseline.

Agent Features

Memory

episodic memory (stored episodes of successful runs)multimodal memory (text + images)retrieval keys (short phrases derived from action plans)

Planning

in-context planningReAct-style interleaved reasoning and acting

Tool Use

environment actions (text commands / simulated controls)policy network for low-level action mapping

Frameworks

ReActRAGReflexionADaPTExpeL

Is Agentic

Yes

Architectures

LLM agentVLM + policy network (multimodal agent)

Optimization Features

Token Efficiency

retrieve small windows around most-similar action to reduce context

Training Optimization

few-shot policy training for mapping plans to actions (25 demos)

Inference Optimization

limit retrieved window size to reduce prompt noise

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Requires a corpus of successful episodes; performance drops if similar past examples are missing.

Retrieval quality depends on embedding models and crafted retrieval keys.

When Not To Use

When you have no past successful episodes to store.

When privacy or compliance forbids storing action/observation logs.

Failure Modes

Retrieving irrelevant or misleading episodes that prompt incorrect actions.

Overfitting to common trajectories in memory and failing novel cases.

Core Entities

Models

GPT-3.5GPT-4Llama2-13bLLaVA-13BCogVLM-17BVicuna-v1.5

Metrics

success rateoverall reward/score

Datasets

ALFWorldWebShopFranka KitchenMeta-World

Benchmarks

ALFWorldWebShopFranka KitchenMeta-World

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAP raises ALFWorld success from 52.2% (ReAct, GPT-3.5) to 85.8% with GPT-3.5.

RAP with memory built from training tasks reaches 91.0% success on ALFWorld (GPT-3.5).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding