AdaPlanner: LLM planner that adaptively refines code-style plans from environment feedback

May 26, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

13

Authors

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, Chao Zhang

Links

Abstract / PDF

Why It Matters For Business

AdaPlanner cuts dependence on large labeled datasets and repeated LLM calls by adaptively revising code-style plans, saving annotation and API cost while improving performance on long-horizon text tasks.

Summary TLDR

AdaPlanner is a closed-loop method that lets a single LLM act as both planner and refiner. It writes Python-like plans, checks assertions during execution, and reacts to two kinds of feedback: in-plan (extract info) and out-of-plan (revise whole plan). It also stores successful plans as ‘skills’ for few-shot prompting. On text simulators (ALFWorld, MiniWoB++), AdaPlanner reaches ~91% success, improves over prior prompting baselines, and cuts the need for demonstrations (2x fewer on ALFWorld, ~600x fewer vs a strong supervised method on MiniWoB++) while reducing hallucination via code prompts. Code is available on GitHub.

Problem Statement

Current LLM-agent methods either execute fixed, open-loop plans or only tweak the next action. They fail to revise entire plans in response to unexpected feedback or need heavy task-specific training. The field needs a lightweight, general way to adapt whole plans online without training a plan selector.

Main Contribution

Explicit closed-loop planning that lets an LLM both generate and revise entire plans during execution.

Two refinement modes: in-plan (extract info) and out-of-plan (revise full plan and resume at a checkpoint).

Code-style (Pythonic) prompting to reduce hallucination and make plans executable.

Skill discovery: save successful plan trajectories as few-shot exemplars to boost sample efficiency.

Empirical results on ALFWorld and MiniWoB++ showing higher success and much better sample efficiency versus baselines.

Key Findings

AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.

NumbersSuccess rate 91.79% (ALFWorld Table 2).

AdaPlanner achieves 91.11% success on MiniWoB++ tasks with feedback.

NumbersSuccess rate 91.11% (MiniWoB++ Table 3).

AdaPlanner reduces demonstration needs: 2x fewer samples on ALFWorld and ~600x fewer than CC-Net on MiniWoB++ to reach comparable performance.

Numbers2x and 600x fewer samples (Fig.3, Table 3).

Removing the code interface drops performance sharply (ALFWorld 81%→46%, MiniWoB++ 93%→66%).

NumbersALFWorld 81%→46%; MiniWoB++ 93%→66% (Fig.4c).

Skill discovery increases success notably (ALFWorld nearly doubles; MiniWoB++ +~15%).

NumbersALFWorld ~2x; MiniWoB++ +15% (Fig.4d).

Results

Success rate (ALFWorld)

Value91.79%

BaselineReflexion/ReAct/BUTLER

Success rate (MiniWoB++ with feedback)

Value91.11%

BaselineRCI, CC-Net, WGE

Sample efficiency

Value2x fewer (ALFWorld); ~600x fewer (vs CC-Net on MiniWoB++)

BaselineCC-Net and other supervised baselines

Ablation: code interface

ValueALFWorld drop 81%→46%; MiniWoB++ drop 93%→66%

BaselineAdaPlanner with code interface

Who Should Care

What To Try In 7 Days

Replace free-text action plans with small code-style templates to constrain LLM outputs.

Implement an ask_LLM() step to extract structured info from environment feedback.

Add simple checkpointing (start_from) so your system can refine a plan and resume mid-episode instead of restarting.

Agent Features

Memory

  • skill discovery: store successful plans as few-shot exemplars

Planning

  • explicit closed-loop plan refinement
  • in-plan refinement (ask_LLM())
  • out-of-plan refinement (revise whole plan and resume)

Tool Use

  • ask_LLM() action for parsing observations
  • Pythonic plan as executable code

Frameworks

  • code-style (Pythonic) prompting
  • refine-then-resume checkpoint mechanism

Is Agentic

true

Architectures

  • LLM-based planner/refiner

Optimization Features

Token Efficiency

  • reduces API calls by checking only key timestamps

System Optimization

  • refine-then-resume reduces repeated whole-episode calls

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires some few-shot demonstrations for complex tasks; not fully zero-shot for hardest cases.
  • Evaluated only on text-based simulated environments (ALFWorld, MiniWoB++), not on robotics or visual sensors.
  • Performance depends on LLM quality; smaller or noisier models can still hallucinate.

When Not To Use

  • Safety-critical real-world systems without extensive validation.
  • Perception-heavy domains requiring raw visual or sensor grounding not supported by text prompts.
  • Tasks where environment feedback is not available or not expressible in text/HTML.

Failure Modes

  • LLM hallucination when prompts are ambiguous or model is lower-capacity.
  • Overfitting of discovered skills to episode-specific details leading to poor generalization.
  • Incorrect assumptions in initial plan that require multiple refinements or human intervention.

Core Entities

Models

  • text-davinci-002 (GPT-3)
  • text-davinci-003 (GPT-3.5)
  • gpt-3.5-turbo

Metrics

  • success rate (%)
  • number of expert demonstrations (sample efficiency)

Datasets

  • ALFWorld
  • MiniWoB++

Benchmarks

  • ALFWorld (134 tasks)
  • MiniWoB++ (9 feedback tasks; 53 tasks subset)