RAFA: plan several steps with an LLM, execute only the first, replan — provable √T regret and strong sample efficiency

September 29, 202310 min

Overview

Production Readiness

0.6

Novelty Score

0.8

Cost Impact Score

0.6

Citation Count

2

Authors

Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, Zhaoran Wang

Links

Abstract / PDF

Why It Matters For Business

RAFA reduces costly environment trials by using LLMs as in-context model estimators and planning ahead, so you can ship agents that learn faster without fine-tuning models.

Summary TLDR

RAFA is a prompting-based framework that makes an LLM alternate between (1) reasoning from a memory buffer to estimate the environment and plan a multi-step trajectory and (2) executing only the first planned action, storing feedback, and replanning. This 'reason for future, act for now' loop uses in-context learning (no weight updates) and is proven to achieve √T Bayesian regret under reasonable assumptions. Empirically RAFA improves sample efficiency and success rates on Game of 24, ALFWorld, BlocksWorld, and Tic-Tac-Toe versus ReAct, Reflexion, AdaPlanner and open-loop planners.

Problem Statement

LLM agents can reason but are stateless and ungrounded; we need a practical, sample-efficient protocol that turns LLM reasoning into actions while minimizing costly environment interactions and giving theoretical guarantees.

Main Contribution

A practical closed-loop prompting framework (RAFA) that alternates multi-step planning by an LLM with executing only the first action and storing feedback.

A formal mapping from LLM in-context learning to Bayesian adaptive MDPs, letting LLMs act as model/value estimators without parameter updates.

Theoretical guarantees: RAFA variants attain provable √T Bayesian regret under assumptions; two exploration variants (optimistic bonus and posterior sampling) drop a concentrability assumption.

Empirical wins: RAFA raises success rates and sample efficiency on Game of 24, ALFWorld, BlocksWorld, and Tic-Tac-Toe using GPT-family and open models.

Key Findings

RAFA achieves state-of-the-art success on ALFWorld.

Numbers99.25% total success rate (ALFWorld tasks)

RAFA boosts Game of 24 solving with GPT-4.

Numbers89% (B=1) and 93% (B=2) success vs ToT 73%/81% (GPT-4)

RAFA attains √T Bayesian regret in theory.

NumbersRegret bound scales as Õ((1-γ)^{-1} · √d^3 T) in linear-kernel MDPs

RAFA can convert LLM in-context gains into better environment estimates.

NumbersLLMs improved prediction accuracy with as few as 6 in-context examples in an information-bandit test

RAFA turns a losing Tic-Tac-Toe side into a strong opponent after a few episodes.

NumbersRAFA (O) wins 100% at T=7 vs GPT-4 baseline X wins 90%

Results

Game of 24 success rate

ValueGPT-4: 89% (B=1), 93% (B=2); GPT-3.5: 29% (B=1), 46% (B=2)

BaselineTree-of-Thoughts (ToT) GPT-4: 73%/81%; Reflexion GPT-4: 21%

ALFWorld overall success rate

ValueRAFA: 99.25%

BaselineAdaPlanner: 91.79%; Reflexion: 92.54%; ReAct: 61.94%

Tic-Tac-Toe win/tie/loss (RAFA as O)

ValueRAFA T=7: 0% X win, 0% tie, 100% O win

Baselinegpt-4 vs gpt-4 baseline: 90% X win, 0% tie, 10% O win

Theoretical Bayesian regret

ValueR(T)=Õ((1-γ)^{-1} · √d^3 T) in linear-kernel MDPs (with suitable choices)

BaselineNo prior closed-loop LLM method with such regret bound

BlocksWorld sample efficiency

ValueVicuna-13B >50% within 8 steps (4-step tasks); Vicuna-33B >80% within 8 steps

BaselineCoT and RAP underperform due to missing feedback loop

Who Should Care

What To Try In 7 Days

Implement a small memory buffer of trajectories and prompt your LLM to both simulate next states (Model) and score rollouts (Critic).

Plan multi-step trajectories and only execute the first action; collect the true next state and add a short summary to the buffer.

Use a simple switching rule (e.g., when prediction disagrees with observation) to trigger re-prompting and re-planning.

Agent Features

Memory

  • In-context memory buffer (stores state, action, reward, next state and linguistic summary)
  • Switching condition controls when buffer becomes active context
  • Summaries used to avoid token explosion

Planning

  • Multi-step trajectory planning (in-context)
  • Plan-and-execute-first-action (model predictive control style)
  • LoRA

Tool Use

  • Prompted LLM instances for Model, Critic, Elite
  • Memory buffer with summarized interaction history
  • No parameter updates; in-context learning only

Frameworks

  • RAFA
  • Tree of Thoughts (ToT)
  • ReAct
  • Reflexion
  • AdaPlanner

Is Agentic

true

Architectures

  • LLM-based planner (Model/Critic/Elite instances)
  • Tree-search / Beam-search / MCTS
  • Value-iteration emulation (truncated horizon)

Optimization Features

Token Efficiency

  • Store compressed linguistic summaries of failure trajectories to reduce prompt size
  • Switching condition avoids adding every step to the reasoning context

Training Optimization

  • No online parameter updates; rely on pretrained LLMs and in-context learning

Inference Optimization

  • Plan breadth/depth trade-offs (B, U) to control computation vs performance
  • Use Elite to propose candidate actions and Critic to prune

Reproducibility

Data Urls

  • ALFWorld (public)
  • Game of 24 subset (public)
  • BlocksWorld (from Hao et al. 2023)
  • Tic-Tac-Toe (constructed by authors)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Theory assumes LLM posterior alignment and MDP regularity; real LLMs may deviate and add an approximation error.
  • RAFA relies on multiple LLM calls per planning loop which raises latency and API cost.
  • Token limits constrain memory size; the method needs careful summarization for long horizons.
  • Empirical tasks are text-based; outcomes may differ in rich visual or robotics settings without additional perception grounding.

When Not To Use

  • When each environment interaction is essentially free and offline RL fine-tuning is feasible.
  • When you cannot afford repeated LLM calls due to cost or latency constraints.
  • When the agent must act in real time with strict millisecond requirements.

Failure Modes

  • Poor in-context examples or summaries can bias the LLM model and lead to systematic errors.
  • Planner suboptimality (small breadth/depth) may trap the agent in local optima despite replanning.
  • Concentrability assumption violations (distribution shift) can degrade theoretical guarantees unless using exploration variants.

Core Entities

Models

  • gpt-4
  • gpt-3.5-turbo
  • text-davinci-003
  • Llama 2-7B
  • Vicuna-13B
  • Vicuna-33B

Metrics

  • success rate (%)
  • Bayesian regret (theoretical)
  • sample efficiency (tasks solved per step)
  • win/tie/loss rates (games)

Datasets

  • Game of 24 (subset 901-1000)
  • ALFWorld
  • BlocksWorld
  • Tic-Tac-Toe (new benchmark)

Benchmarks

  • Game of 24
  • ALFWorld
  • BlocksWorld
  • Tic-Tac-Toe