Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
13
Why It Matters For Business
AdaPlanner cuts dependence on large labeled datasets and repeated LLM calls by adaptively revising code-style plans, saving annotation and API cost while improving performance on long-horizon text tasks.
Summary TLDR
AdaPlanner is a closed-loop method that lets a single LLM act as both planner and refiner. It writes Python-like plans, checks assertions during execution, and reacts to two kinds of feedback: in-plan (extract info) and out-of-plan (revise whole plan). It also stores successful plans as ‘skills’ for few-shot prompting. On text simulators (ALFWorld, MiniWoB++), AdaPlanner reaches ~91% success, improves over prior prompting baselines, and cuts the need for demonstrations (2x fewer on ALFWorld, ~600x fewer vs a strong supervised method on MiniWoB++) while reducing hallucination via code prompts. Code is available on GitHub.
Problem Statement
Current LLM-agent methods either execute fixed, open-loop plans or only tweak the next action. They fail to revise entire plans in response to unexpected feedback or need heavy task-specific training. The field needs a lightweight, general way to adapt whole plans online without training a plan selector.
Main Contribution
Explicit closed-loop planning that lets an LLM both generate and revise entire plans during execution.
Two refinement modes: in-plan (extract info) and out-of-plan (revise full plan and resume at a checkpoint).
Code-style (Pythonic) prompting to reduce hallucination and make plans executable.
Skill discovery: save successful plan trajectories as few-shot exemplars to boost sample efficiency.
Empirical results on ALFWorld and MiniWoB++ showing higher success and much better sample efficiency versus baselines.
Key Findings
AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.
AdaPlanner achieves 91.11% success on MiniWoB++ tasks with feedback.
AdaPlanner reduces demonstration needs: 2x fewer samples on ALFWorld and ~600x fewer than CC-Net on MiniWoB++ to reach comparable performance.
Removing the code interface drops performance sharply (ALFWorld 81%→46%, MiniWoB++ 93%→66%).
Skill discovery increases success notably (ALFWorld nearly doubles; MiniWoB++ +~15%).
Results
Success rate (ALFWorld)
Success rate (MiniWoB++ with feedback)
Sample efficiency
Ablation: code interface
Who Should Care
What To Try In 7 Days
Replace free-text action plans with small code-style templates to constrain LLM outputs.
Implement an ask_LLM() step to extract structured info from environment feedback.
Add simple checkpointing (start_from) so your system can refine a plan and resume mid-episode instead of restarting.
Agent Features
Memory
- skill discovery: store successful plans as few-shot exemplars
Planning
- explicit closed-loop plan refinement
- in-plan refinement (ask_LLM())
- out-of-plan refinement (revise whole plan and resume)
Tool Use
- ask_LLM() action for parsing observations
- Pythonic plan as executable code
Frameworks
- code-style (Pythonic) prompting
- refine-then-resume checkpoint mechanism
Is Agentic
true
Architectures
- LLM-based planner/refiner
Optimization Features
Token Efficiency
- reduces API calls by checking only key timestamps
System Optimization
- refine-then-resume reduces repeated whole-episode calls
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires some few-shot demonstrations for complex tasks; not fully zero-shot for hardest cases.
- Evaluated only on text-based simulated environments (ALFWorld, MiniWoB++), not on robotics or visual sensors.
- Performance depends on LLM quality; smaller or noisier models can still hallucinate.
When Not To Use
- Safety-critical real-world systems without extensive validation.
- Perception-heavy domains requiring raw visual or sensor grounding not supported by text prompts.
- Tasks where environment feedback is not available or not expressible in text/HTML.
Failure Modes
- LLM hallucination when prompts are ambiguous or model is lower-capacity.
- Overfitting of discovered skills to episode-specific details leading to poor generalization.
- Incorrect assumptions in initial plan that require multiple refinements or human intervention.
Core Entities
Models
- text-davinci-002 (GPT-3)
- text-davinci-003 (GPT-3.5)
- gpt-3.5-turbo
Metrics
- success rate (%)
- number of expert demonstrations (sample efficiency)
Datasets
- ALFWorld
- MiniWoB++
Benchmarks
- ALFWorld (134 tasks)
- MiniWoB++ (9 feedback tasks; 53 tasks subset)

