AdaPlanner: LLM planner that adaptively refines code-style plans from environment feedback

May 26, 20237 min

Overview

Decision SnapshotNeeds Validation

AdaPlanner structures plans as executable code, distinguishes two feedback types, and refines/resumes without training, improving sample and call efficiency on text simulators.

Citations13

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, Chao Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

AdaPlanner cuts dependence on large labeled datasets and repeated LLM calls by adaptively revising code-style plans, saving annotation and API cost while improving performance on long-horizon text tasks.

Who Should Care

Summary TLDR

AdaPlanner is a closed-loop method that lets a single LLM act as both planner and refiner. It writes Python-like plans, checks assertions during execution, and reacts to two kinds of feedback: in-plan (extract info) and out-of-plan (revise whole plan). It also stores successful plans as ‘skills’ for few-shot prompting. On text simulators (ALFWorld, MiniWoB++), AdaPlanner reaches ~91% success, improves over prior prompting baselines, and cuts the need for demonstrations (2x fewer on ALFWorld, ~600x fewer vs a strong supervised method on MiniWoB++) while reducing hallucination via code prompts. Code is available on GitHub.

Problem Statement

Current LLM-agent methods either execute fixed, open-loop plans or only tweak the next action. They fail to revise entire plans in response to unexpected feedback or need heavy task-specific training. The field needs a lightweight, general way to adapt whole plans online without training a plan selector.

Main Contribution

Explicit closed-loop planning that lets an LLM both generate and revise entire plans during execution.

Two refinement modes: in-plan (extract info) and out-of-plan (revise full plan and resume at a checkpoint).

Key Findings

AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.

NumbersSuccess rate 91.79% (ALFWorld Table 2).

Practical UseExpect high success on text-based household tasks without training; use AdaPlanner to avoid large supervised datasets.

Evidence RefTable 2

AdaPlanner achieves 91.11% success on MiniWoB++ tasks with feedback.

NumbersSuccess rate 91.11% (MiniWoB++ Table 3).

Practical UseAdaPlanner works well on web-like HTML tasks where environment feedback is available.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Success rate (ALFWorld)91.79%Reflexion/ReAct/BUTLERup to + few percent over prompting baselines134 ALFWorld tasksTable 2 shows AdaPlanner 91.79% overall.Table 2
Success rate (MiniWoB++ with feedback)91.11%RCI, CC-Net, WGEbetter than RCI and comparable to CC-Net with far fewer samplesMiniWoB++ (9 feedback tasks; 53 task subset)Table 3 reports 91.11% with AdaPlanner.Table 3

What To Try In 7 Days

Replace free-text action plans with small code-style templates to constrain LLM outputs.

Implement an ask_LLM() step to extract structured info from environment feedback.

Add simple checkpointing (start_from) so your system can refine a plan and resume mid-episode instead of restarting.

Agent Features

Memory
skill discovery: store successful plans as few-shot exemplars
Planning
explicit closed-loop plan refinementin-plan refinement (ask_LLM())out-of-plan refinement (revise whole plan and resume)
Tool Use
ask_LLM() action for parsing observationsPythonic plan as executable code
Frameworks
code-style (Pythonic) promptingrefine-then-resume checkpoint mechanism
Is Agentic

Yes

Architectures
LLM-based planner/refiner

Optimization Features

Token Efficiency
reduces API calls by checking only key timestamps
System Optimization
refine-then-resume reduces repeated whole-episode calls

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires some few-shot demonstrations for complex tasks; not fully zero-shot for hardest cases.

Evaluated only on text-based simulated environments (ALFWorld, MiniWoB++), not on robotics or visual sensors.

When Not To Use

Safety-critical real-world systems without extensive validation.

Perception-heavy domains requiring raw visual or sensor grounding not supported by text prompts.

Failure Modes

LLM hallucination when prompts are ambiguous or model is lower-capacity.

Overfitting of discovered skills to episode-specific details leading to poor generalization.

Core Entities

Models

text-davinci-002 (GPT-3)text-davinci-003 (GPT-3.5)gpt-3.5-turbo

Metrics

success rate (%)number of expert demonstrations (sample efficiency)

Datasets

ALFWorldMiniWoB++

Benchmarks

ALFWorld (134 tasks)MiniWoB++ (9 feedback tasks; 53 tasks subset)