RAFA: plan several steps with an LLM, execute only the first, replan — provable √T regret and strong sample efficiency

Overview

Decision SnapshotReady For Pilot

RAFA is a clear, implementable protocol: use LLMs as prompted Model/Critic/Elite, plan several steps, execute the first, record feedback, replan; theoretical √T regret supports expected learning gains under stated assumptions.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 80%

Authors

Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, Zhaoran Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAFA reduces costly environment trials by using LLMs as in-context model estimators and planning ahead, so you can ship agents that learn faster without fine-tuning models.

Who Should Care

Product Manager ML Engineer Founder Engineering Lead

Summary TLDR

RAFA is a prompting-based framework that makes an LLM alternate between (1) reasoning from a memory buffer to estimate the environment and plan a multi-step trajectory and (2) executing only the first planned action, storing feedback, and replanning. This 'reason for future, act for now' loop uses in-context learning (no weight updates) and is proven to achieve √T Bayesian regret under reasonable assumptions. Empirically RAFA improves sample efficiency and success rates on Game of 24, ALFWorld, BlocksWorld, and Tic-Tac-Toe versus ReAct, Reflexion, AdaPlanner and open-loop planners.

Problem Statement

LLM agents can reason but are stateless and ungrounded; we need a practical, sample-efficient protocol that turns LLM reasoning into actions while minimizing costly environment interactions and giving theoretical guarantees.

Main Contribution

A practical closed-loop prompting framework (RAFA) that alternates multi-step planning by an LLM with executing only the first action and storing feedback.

A formal mapping from LLM in-context learning to Bayesian adaptive MDPs, letting LLMs act as model/value estimators without parameter updates.

Key Findings

RAFA achieves state-of-the-art success on ALFWorld.

Numbers99.25% total success rate (ALFWorld tasks)

Practical UseUse RAFA-style planning+short execution for embodied text environments to get near-perfect task success with fewer trials.

Evidence RefTable 3

RAFA boosts Game of 24 solving with GPT-4.

Numbers89% (B=1) and 93% (B=2) success vs ToT 73%/81% (GPT-4)

Practical UsePrompt LLMs to plan multiple steps and only execute the first to reduce hallucination and increase solved puzzles per interaction.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Game of 24 success rate	GPT-4: 89% (B=1), 93% (B=2); GPT-3.5: 29% (B=1), 46% (B=2)	Tree-of-Thoughts (ToT) GPT-4: 73%/81%; Reflexion GPT-4: 21%	GPT-4 +16 to +12 pts over ToT	Game of 24 (100-task subset)	Table 2: RAFA vs ToT and Reflexion	Table 2
ALFWorld overall success rate	RAFA: 99.25%	AdaPlanner: 91.79%; Reflexion: 92.54%; ReAct: 61.94%	+7.5 to +37.3 pts	ALFWorld (134 tasks across 6 categories)	Table 3: per-category success rates and total	Table 3

What To Try In 7 Days

Implement a small memory buffer of trajectories and prompt your LLM to both simulate next states (Model) and score rollouts (Critic).

Plan multi-step trajectories and only execute the first action; collect the true next state and add a short summary to the buffer.

Use a simple switching rule (e.g., when prediction disagrees with observation) to trigger re-prompting and re-planning.

Agent Features

Memory

In-context memory buffer (stores state, action, reward, next state and linguistic summary)Switching condition controls when buffer becomes active contextSummaries used to avoid token explosion

Planning

Multi-step trajectory planning (in-context)Plan-and-execute-first-action (model predictive control style)LoRA

Tool Use

Prompted LLM instances for Model, Critic, EliteMemory buffer with summarized interaction historyNo parameter updates; in-context learning only

Frameworks

RAFATree of Thoughts (ToT)ReActReflexionAdaPlanner

Is Agentic

Yes

Architectures

LLM-based planner (Model/Critic/Elite instances)Tree-search / Beam-search / MCTSValue-iteration emulation (truncated horizon)

Optimization Features

Token Efficiency

Store compressed linguistic summaries of failure trajectories to reduce prompt sizeSwitching condition avoids adding every step to the reasoning context

Training Optimization

No online parameter updates; rely on pretrained LLMs and in-context learning

Inference Optimization

Plan breadth/depth trade-offs (B, U) to control computation vs performanceUse Elite to propose candidate actions and Critic to prune

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://agentification.github.io/RAFA

Data URLs

ALFWorld (public)Game of 24 subset (public)BlocksWorld (from Hao et al. 2023)Tic-Tac-Toe (constructed by authors)

Risks & Boundaries

Limitations

Theory assumes LLM posterior alignment and MDP regularity; real LLMs may deviate and add an approximation error.

RAFA relies on multiple LLM calls per planning loop which raises latency and API cost.

When Not To Use

When each environment interaction is essentially free and offline RL fine-tuning is feasible.

When you cannot afford repeated LLM calls due to cost or latency constraints.

Failure Modes

Poor in-context examples or summaries can bias the LLM model and lead to systematic errors.

Planner suboptimality (small breadth/depth) may trap the agent in local optima despite replanning.

Core Entities

Models

gpt-4gpt-3.5-turbotext-davinci-003Llama 2-7BVicuna-13BVicuna-33B

Metrics

success rate (%)Bayesian regret (theoretical)sample efficiency (tasks solved per step)win/tie/loss rates (games)

Datasets

Game of 24 (subset 901-1000)ALFWorldBlocksWorldTic-Tac-Toe (new benchmark)

Benchmarks

Game of 24ALFWorldBlocksWorldTic-Tac-Toe

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAFA achieves state-of-the-art success on ALFWorld.

RAFA boosts Game of 24 solving with GPT-4.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding