Survey: five practical ways LLMs are used to plan agent behavior

Overview

Decision SnapshotNeeds Validation

The survey aggregates recent methods and small experiments to show practical trade-offs; its evidence is useful for design guidance but not definitive for every domain.

Citations29

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, Enhong Chen

Links

Abstract / PDF

Why It Matters For Business

LLM-driven planning can automate complex multi-step tasks, but higher success usually requires more model calls and tokens, so balance accuracy needs with token cost and latency.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This short survey organizes research on using large language models (LLMs) as the planning core of autonomous agents. It proposes a clear five-way taxonomy — Task Decomposition, Multi-plan Selection, External Planner-Aided, Reflection & Refinement, and Memory-Augmented — analyzes representative methods in each group, and reports small experiments on four benchmarks. Key takeaways: (1) more generated text (tokens) tends to improve success but raises cost; (2) few-shot examples beat zero-shot for complex tasks; (3) reflection/self-correction helps on hard interactive tasks; (4) RAG and fine-tuning trade off update cost vs capacity. It highlights recurring failure modes (hallucination, infeasb.

Problem Statement

How can we use LLMs to generate reliable, feasible, and efficient multi-step plans for autonomous agents? The paper collects and organizes methods, compares representative approaches, and points out practical limits such as hallucinations, infeasible plans, high token costs, and weak fine-grained evaluation.

Main Contribution

A compact taxonomy of LLM-based planning methods into five directions: Task Decomposition, Multi-plan Selection, External Planner-Aided, Reflection & Refinement, and Memory-Augmented Planning.

A concise analysis of representative works, strengths, weaknesses, and common failure modes for each direction.

Key Findings

Spending more tokens (more generated ‘thinking’) tends to raise success.

NumbersALFWorld SR: ReAct 0.57 -> Reflexion 0.71; EX($): 152.18 -> 220.17 (Table 2).

Practical UseIf you need higher task success, allow more model calls / longer prompts, but budget for higher token costs.

Evidence RefTable 2

Few-shot Chain-of-Thought outperforms zero-shot on complex QA tasks.

NumbersHotPotQA SR: Zero-shot CoT 0.01 vs Few-shot CoT 0.32 (Table 2).

Practical UseProvide a few worked examples when prompting LLMs for multi-step or multi-hop QA to avoid severe performance drops.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Success Rate (SR)	Reflexion 0.71 vs ReAct 0.57 on ALFWorld	ReAct 0.57	+0.14	ALFWorld	Reflexion shows higher SR than ReAct on ALFWorld in Table 2	Table 2
Success Rate (SR)	Zero-shot CoT 0.01 vs Few-shot CoT 0.32 on HotPotQA	Zero-shot CoT 0.01	+0.31	HotPotQA	Zero-shot CoT severely underperforms compared to few-shot CoT on HotPotQA (Table 2)	Table 2

What To Try In 7 Days

Run a few-shot CoT prompt on one task and compare with zero-shot to measure gain.

Add a single-shot reflection loop (generate feedback + one retry) to an agent and measure success uplift.

Prototype a small RAG memory using vector embeddings and FAISS for one agent flow.

Agent Features

Memory

RAG (external vector store)Embodied memory via PEFT fine-tuning

Planning

Task DecompositionMulti-plan SelectionExternal Planner-AidedReflection & RefinementMemory-Augmented Planning

Tool Use

Symbolic planners (PDDL, ASP)Search algorithms (MCTS, A*, BFS)Vector DB retrieval (FAISS)

Frameworks

Chain-of-ThoughtReActTree-of-ThoughtProgram-of-ThoughtDecision Transformer

Is Agentic

Yes

Architectures

Large Language Models (prompt-driven)Hybrid LLM + symbolic plannerLLM + lightweight neural planner

Collaboration

LLM coordinates specialized models (HuggingGPT style)

Optimization Features

Token Efficiency

Multi-plan and reflection strategies increase token use and cost

Training Optimization

LoRA

Inference Optimization

Use smaller neural planner as fast-thinking model to reduce LLM calls (SwiftSage idea)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Small-scale experiments (only prompt-based methods, limited budget) limit generality of numerical claims.

Benchmarks used often have single gold paths, so success rates underrepresent valid alternative solutions.

When Not To Use

When you need formal guarantees of plan feasibility or safety (use symbolic planners instead).

When token budget or latency is strictly constrained.

Failure Modes

LLM hallucination: inventing non-existent actions or objects.

Generated plans violating environment constraints (infeasible plans).

Core Entities

Models

LLMs (general)text-davinci-003GPT-4LLaMAGPT-2Decision TransformerDRRN

Metrics

Success Rate (SR)Average Reward (AR)Token Expense (EX)

Datasets

ALFWorldScienceWorldHotPotQAFEVERMinecraft

Benchmarks

ALFWorldScienceWorldHotPotQAFEVER

Context Entities

Models

CodeXbgesmall-en-v1.5 (embedding model)

Metrics

Number of created tools (Minecraft)

Datasets

WebShopMind2WebWebArenaAgentBenchMiniWoB++

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Spending more tokens (more generated ‘thinking’) tends to raise success.

Few-shot Chain-of-Thought outperforms zero-shot on complex QA tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding