Overview
Production Readiness
0.6
Novelty Score
0.8
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
RAFA reduces costly environment trials by using LLMs as in-context model estimators and planning ahead, so you can ship agents that learn faster without fine-tuning models.
Summary TLDR
RAFA is a prompting-based framework that makes an LLM alternate between (1) reasoning from a memory buffer to estimate the environment and plan a multi-step trajectory and (2) executing only the first planned action, storing feedback, and replanning. This 'reason for future, act for now' loop uses in-context learning (no weight updates) and is proven to achieve √T Bayesian regret under reasonable assumptions. Empirically RAFA improves sample efficiency and success rates on Game of 24, ALFWorld, BlocksWorld, and Tic-Tac-Toe versus ReAct, Reflexion, AdaPlanner and open-loop planners.
Problem Statement
LLM agents can reason but are stateless and ungrounded; we need a practical, sample-efficient protocol that turns LLM reasoning into actions while minimizing costly environment interactions and giving theoretical guarantees.
Main Contribution
A practical closed-loop prompting framework (RAFA) that alternates multi-step planning by an LLM with executing only the first action and storing feedback.
A formal mapping from LLM in-context learning to Bayesian adaptive MDPs, letting LLMs act as model/value estimators without parameter updates.
Theoretical guarantees: RAFA variants attain provable √T Bayesian regret under assumptions; two exploration variants (optimistic bonus and posterior sampling) drop a concentrability assumption.
Empirical wins: RAFA raises success rates and sample efficiency on Game of 24, ALFWorld, BlocksWorld, and Tic-Tac-Toe using GPT-family and open models.
Key Findings
RAFA achieves state-of-the-art success on ALFWorld.
RAFA boosts Game of 24 solving with GPT-4.
RAFA attains √T Bayesian regret in theory.
RAFA can convert LLM in-context gains into better environment estimates.
RAFA turns a losing Tic-Tac-Toe side into a strong opponent after a few episodes.
Results
Game of 24 success rate
ALFWorld overall success rate
Tic-Tac-Toe win/tie/loss (RAFA as O)
Theoretical Bayesian regret
BlocksWorld sample efficiency
Who Should Care
What To Try In 7 Days
Implement a small memory buffer of trajectories and prompt your LLM to both simulate next states (Model) and score rollouts (Critic).
Plan multi-step trajectories and only execute the first action; collect the true next state and add a short summary to the buffer.
Use a simple switching rule (e.g., when prediction disagrees with observation) to trigger re-prompting and re-planning.
Agent Features
Memory
- In-context memory buffer (stores state, action, reward, next state and linguistic summary)
- Switching condition controls when buffer becomes active context
- Summaries used to avoid token explosion
Planning
- Multi-step trajectory planning (in-context)
- Plan-and-execute-first-action (model predictive control style)
- LoRA
Tool Use
- Prompted LLM instances for Model, Critic, Elite
- Memory buffer with summarized interaction history
- No parameter updates; in-context learning only
Frameworks
- RAFA
- Tree of Thoughts (ToT)
- ReAct
- Reflexion
- AdaPlanner
Is Agentic
true
Architectures
- LLM-based planner (Model/Critic/Elite instances)
- Tree-search / Beam-search / MCTS
- Value-iteration emulation (truncated horizon)
Optimization Features
Token Efficiency
- Store compressed linguistic summaries of failure trajectories to reduce prompt size
- Switching condition avoids adding every step to the reasoning context
Training Optimization
- No online parameter updates; rely on pretrained LLMs and in-context learning
Inference Optimization
- Plan breadth/depth trade-offs (B, U) to control computation vs performance
- Use Elite to propose candidate actions and Critic to prune
Reproducibility
Data Urls
- ALFWorld (public)
- Game of 24 subset (public)
- BlocksWorld (from Hao et al. 2023)
- Tic-Tac-Toe (constructed by authors)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Theory assumes LLM posterior alignment and MDP regularity; real LLMs may deviate and add an approximation error.
- RAFA relies on multiple LLM calls per planning loop which raises latency and API cost.
- Token limits constrain memory size; the method needs careful summarization for long horizons.
- Empirical tasks are text-based; outcomes may differ in rich visual or robotics settings without additional perception grounding.
When Not To Use
- When each environment interaction is essentially free and offline RL fine-tuning is feasible.
- When you cannot afford repeated LLM calls due to cost or latency constraints.
- When the agent must act in real time with strict millisecond requirements.
Failure Modes
- Poor in-context examples or summaries can bias the LLM model and lead to systematic errors.
- Planner suboptimality (small breadth/depth) may trap the agent in local optima despite replanning.
- Concentrability assumption violations (distribution shift) can degrade theoretical guarantees unless using exploration variants.
Core Entities
Models
- gpt-4
- gpt-3.5-turbo
- text-davinci-003
- Llama 2-7B
- Vicuna-13B
- Vicuna-33B
Metrics
- success rate (%)
- Bayesian regret (theoretical)
- sample efficiency (tasks solved per step)
- win/tie/loss rates (games)
Datasets
- Game of 24 (subset 901-1000)
- ALFWorld
- BlocksWorld
- Tic-Tac-Toe (new benchmark)
Benchmarks
- Game of 24
- ALFWorld
- BlocksWorld
- Tic-Tac-Toe

