Use past successful episodes as memory to boost LLM agent planning in text and vision tasks

February 6, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

5

Authors

Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, Yang You

Links

Abstract / PDF

Why It Matters For Business

RAP turns past successful runs into reusable context that raises accuracy for multi-step text and embodied agents, reducing trial-and-error and speeding up deployment in web automation and robotic workflows.

Summary TLDR

RAP stores short episode logs (plans, actions, observations) and retrieves the most relevant snippets to feed as in-context examples to an LLM agent. Retrieval scores combine task, plan, and a generated retrieval key; retrieved windows center on the most similar action. RAP improves multi-step text agents and multimodal embodied agents across ALFWorld, WebShop, Franka Kitchen and Meta-World, often substantially outperforming ReAct and other baselines. Memory can be built from one model and used by another.

Problem Statement

LLM agents struggle to reuse relevant past experiences when planning multi-step actions, especially in multimodal and embodied tasks. There is no unified way to store, retrieve, and integrate multimodal episode memories to improve sequential decision-making.

Main Contribution

RAP: a modular retrieval-augmented planning pipeline (Memory, Reasoner, Retriever, Executor) that stores episodes and retrieves context-aware examples to the agent.

Multimodal memory and retrieval: supports text and image observations via sentence embeddings and CLIP/Vision Transformer features.

Empirical gains across four benchmarks (ALFWorld, WebShop, Franka Kitchen, Meta-World) and multiple LLM/VLM backbones; memory is transferable across models.

Key Findings

RAP raises ALFWorld success from 52.2% (ReAct, GPT-3.5) to 85.8% with GPT-3.5.

Numbers52.2% → 85.8% (ALFWorld, Table 1)

RAP with memory built from training tasks reaches 91.0% success on ALFWorld (GPT-3.5).

NumbersRAP_train 91.0% (Table 1)

On WebShop, RAP improves success rate from 35.0% to 48.0% and overall reward from 61.8 to 76.1 (GPT-3.5).

NumbersSuccess 35% → 48%; Score 61.8 → 76.1 (Table 2)

In embodied vision tasks, RAP raises Franka Kitchen success for LLaVA from 43.4% to 61.6% and Meta-World from 65.4% to 79.2%.

NumbersFranka 43.4% → 61.6%; Meta-World 65.4% → 79.2% (Table 3)

Memory transfers across models: Llama2-13b success rises from 20.9% to 27.6% using memory constructed by GPT-3.5.

Numbers20.9% → 27.6% (Table 6)

Ablations show retrieval choice matters: using observation-based retrieval raises ALFWorld success from 82.1% (action-only) to 86.6% when using CLIP image retrieval.

NumbersRAP_act 82.1%, RAP_obs 84.3%, RAP_clip 86.6% (Table 4)

Results

ALFWorld success rate (All)

Value85.8% (RAP, GPT-3.5)

Baseline52.2% (ReAct, GPT-3.5)

ALFWorld success rate (RAP_train)

Value91.0% (RAP_train, GPT-3.5)

Baseline52.2% (ReAct, GPT-3.5)

WebShop success rate

Value48.0% (RAP, GPT-3.5)

Baseline35.0% (ReAct, GPT-3.5)

WebShop overall score

Value76.1 (RAP, GPT-3.5)

Baseline61.8 (ReAct, GPT-3.5)

Franka Kitchen success rate

Value61.6% (LLaVA with RAP)

Baseline43.4% (LLaVA)

Meta-World success rate

Value79.2% (LLaVA with RAP)

Baseline65.4% (LLaVA)

Who Should Care

What To Try In 7 Days

Log successful episodes (task, plan, actions, observations) from your agent.

Add a retrieval step that computes similarity across task, plan, and a retrieval key.

Feed top-k nearby trajectory windows as in-context examples and compare success rate vs baseline.

Agent Features

Memory

  • episodic memory (stored episodes of successful runs)
  • multimodal memory (text + images)
  • retrieval keys (short phrases derived from action plans)

Planning

  • in-context planning
  • ReAct-style interleaved reasoning and acting

Tool Use

  • environment actions (text commands / simulated controls)
  • policy network for low-level action mapping

Frameworks

  • ReAct
  • RAG
  • Reflexion
  • ADaPT
  • ExpeL

Is Agentic

true

Architectures

  • LLM agent
  • VLM + policy network (multimodal agent)

Optimization Features

Token Efficiency

  • retrieve small windows around most-similar action to reduce context

Training Optimization

  • few-shot policy training for mapping plans to actions (25 demos)

Inference Optimization

  • limit retrieved window size to reduce prompt noise

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires a corpus of successful episodes; performance drops if similar past examples are missing.
  • Retrieval quality depends on embedding models and crafted retrieval keys.
  • Experiments run in simulation; real-robot transfer and real web systems may incur additional challenges.

When Not To Use

  • When you have no past successful episodes to store.
  • When privacy or compliance forbids storing action/observation logs.
  • When inference cost prevents extra embedding and retrieval steps.

Failure Modes

  • Retrieving irrelevant or misleading episodes that prompt incorrect actions.
  • Overfitting to common trajectories in memory and failing novel cases.
  • Embedding mismatch between models producing memory and models consuming it.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • Llama2-13b
  • LLaVA-13B
  • CogVLM-17B
  • Vicuna-v1.5

Metrics

  • success rate
  • overall reward/score

Datasets

  • ALFWorld
  • WebShop
  • Franka Kitchen
  • Meta-World

Benchmarks

  • ALFWorld
  • WebShop
  • Franka Kitchen
  • Meta-World