AdaPlanner: LLM planner that adaptively refines code-style plans from environment feedback

Overview

Decision SnapshotNeeds Validation

AdaPlanner structures plans as executable code, distinguishes two feedback types, and refines/resumes without training, improving sample and call efficiency on text simulators.

Citations13

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, Chao Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

AdaPlanner cuts dependence on large labeled datasets and repeated LLM calls by adaptively revising code-style plans, saving annotation and API cost while improving performance on long-horizon text tasks.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

AdaPlanner is a closed-loop method that lets a single LLM act as both planner and refiner. It writes Python-like plans, checks assertions during execution, and reacts to two kinds of feedback: in-plan (extract info) and out-of-plan (revise whole plan). It also stores successful plans as ‘skills’ for few-shot prompting. On text simulators (ALFWorld, MiniWoB++), AdaPlanner reaches ~91% success, improves over prior prompting baselines, and cuts the need for demonstrations (2x fewer on ALFWorld, ~600x fewer vs a strong supervised method on MiniWoB++) while reducing hallucination via code prompts. Code is available on GitHub.

Problem Statement

Current LLM-agent methods either execute fixed, open-loop plans or only tweak the next action. They fail to revise entire plans in response to unexpected feedback or need heavy task-specific training. The field needs a lightweight, general way to adapt whole plans online without training a plan selector.

Main Contribution

Explicit closed-loop planning that lets an LLM both generate and revise entire plans during execution.

Two refinement modes: in-plan (extract info) and out-of-plan (revise full plan and resume at a checkpoint).

Key Findings

AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.

NumbersSuccess rate 91.79% (ALFWorld Table 2).

Practical UseExpect high success on text-based household tasks without training; use AdaPlanner to avoid large supervised datasets.

Evidence RefTable 2

AdaPlanner achieves 91.11% success on MiniWoB++ tasks with feedback.

NumbersSuccess rate 91.11% (MiniWoB++ Table 3).

Practical UseAdaPlanner works well on web-like HTML tasks where environment feedback is available.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Success rate (ALFWorld)	91.79%	Reflexion/ReAct/BUTLER	up to + few percent over prompting baselines	134 ALFWorld tasks	Table 2 shows AdaPlanner 91.79% overall.	Table 2
Success rate (MiniWoB++ with feedback)	91.11%	RCI, CC-Net, WGE	better than RCI and comparable to CC-Net with far fewer samples	MiniWoB++ (9 feedback tasks; 53 task subset)	Table 3 reports 91.11% with AdaPlanner.	Table 3

What To Try In 7 Days

Replace free-text action plans with small code-style templates to constrain LLM outputs.

Implement an ask_LLM() step to extract structured info from environment feedback.

Add simple checkpointing (start_from) so your system can refine a plan and resume mid-episode instead of restarting.

Agent Features

Memory

skill discovery: store successful plans as few-shot exemplars

Planning

explicit closed-loop plan refinementin-plan refinement (ask_LLM())out-of-plan refinement (revise whole plan and resume)

Tool Use

ask_LLM() action for parsing observationsPythonic plan as executable code

Frameworks

code-style (Pythonic) promptingrefine-then-resume checkpoint mechanism

Is Agentic

Yes

Architectures

LLM-based planner/refiner

Optimization Features

Token Efficiency

reduces API calls by checking only key timestamps

System Optimization

refine-then-resume reduces repeated whole-episode calls

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/haotiansun14/AdaPlanner

Risks & Boundaries

Limitations

Requires some few-shot demonstrations for complex tasks; not fully zero-shot for hardest cases.

Evaluated only on text-based simulated environments (ALFWorld, MiniWoB++), not on robotics or visual sensors.

When Not To Use

Safety-critical real-world systems without extensive validation.

Perception-heavy domains requiring raw visual or sensor grounding not supported by text prompts.

Failure Modes

LLM hallucination when prompts are ambiguous or model is lower-capacity.

Overfitting of discovered skills to episode-specific details leading to poor generalization.

Core Entities

Models

text-davinci-002 (GPT-3)text-davinci-003 (GPT-3.5)gpt-3.5-turbo

Metrics

success rate (%)number of expert demonstrations (sample efficiency)

Datasets

ALFWorldMiniWoB++

Benchmarks

ALFWorld (134 tasks)MiniWoB++ (9 feedback tasks; 53 tasks subset)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.

AdaPlanner achieves 91.11% success on MiniWoB++ tasks with feedback.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding