Survey: five practical ways LLMs are used to plan agent behavior

February 5, 20247 min

Overview

Decision SnapshotNeeds Validation

The survey aggregates recent methods and small experiments to show practical trade-offs; its evidence is useful for design guidance but not definitive for every domain.

Citations29

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, Enhong Chen

Links

Abstract / PDF

Why It Matters For Business

LLM-driven planning can automate complex multi-step tasks, but higher success usually requires more model calls and tokens, so balance accuracy needs with token cost and latency.

Who Should Care

Summary TLDR

This short survey organizes research on using large language models (LLMs) as the planning core of autonomous agents. It proposes a clear five-way taxonomy — Task Decomposition, Multi-plan Selection, External Planner-Aided, Reflection & Refinement, and Memory-Augmented — analyzes representative methods in each group, and reports small experiments on four benchmarks. Key takeaways: (1) more generated text (tokens) tends to improve success but raises cost; (2) few-shot examples beat zero-shot for complex tasks; (3) reflection/self-correction helps on hard interactive tasks; (4) RAG and fine-tuning trade off update cost vs capacity. It highlights recurring failure modes (hallucination, infeasb.

Problem Statement

How can we use LLMs to generate reliable, feasible, and efficient multi-step plans for autonomous agents? The paper collects and organizes methods, compares representative approaches, and points out practical limits such as hallucinations, infeasible plans, high token costs, and weak fine-grained evaluation.

Main Contribution

A compact taxonomy of LLM-based planning methods into five directions: Task Decomposition, Multi-plan Selection, External Planner-Aided, Reflection & Refinement, and Memory-Augmented Planning.

A concise analysis of representative works, strengths, weaknesses, and common failure modes for each direction.

Key Findings

Spending more tokens (more generated ‘thinking’) tends to raise success.

NumbersALFWorld SR: ReAct 0.57 -> Reflexion 0.71; EX($): 152.18 -> 220.17 (Table 2).

Practical UseIf you need higher task success, allow more model calls / longer prompts, but budget for higher token costs.

Evidence RefTable 2

Few-shot Chain-of-Thought outperforms zero-shot on complex QA tasks.

NumbersHotPotQA SR: Zero-shot CoT 0.01 vs Few-shot CoT 0.32 (Table 2).

Practical UseProvide a few worked examples when prompting LLMs for multi-step or multi-hop QA to avoid severe performance drops.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Success Rate (SR)Reflexion 0.71 vs ReAct 0.57 on ALFWorldReAct 0.57+0.14ALFWorldReflexion shows higher SR than ReAct on ALFWorld in Table 2Table 2
Success Rate (SR)Zero-shot CoT 0.01 vs Few-shot CoT 0.32 on HotPotQAZero-shot CoT 0.01+0.31HotPotQAZero-shot CoT severely underperforms compared to few-shot CoT on HotPotQA (Table 2)Table 2

What To Try In 7 Days

Run a few-shot CoT prompt on one task and compare with zero-shot to measure gain.

Add a single-shot reflection loop (generate feedback + one retry) to an agent and measure success uplift.

Prototype a small RAG memory using vector embeddings and FAISS for one agent flow.

Agent Features

Memory
RAG (external vector store)Embodied memory via PEFT fine-tuning
Planning
Task DecompositionMulti-plan SelectionExternal Planner-AidedReflection & RefinementMemory-Augmented Planning
Tool Use
Symbolic planners (PDDL, ASP)Search algorithms (MCTS, A*, BFS)Vector DB retrieval (FAISS)
Frameworks
Chain-of-ThoughtReActTree-of-ThoughtProgram-of-ThoughtDecision Transformer
Is Agentic

Yes

Architectures
Large Language Models (prompt-driven)Hybrid LLM + symbolic plannerLLM + lightweight neural planner
Collaboration
LLM coordinates specialized models (HuggingGPT style)

Optimization Features

Token Efficiency
Multi-plan and reflection strategies increase token use and cost
Training Optimization
LoRA
Inference Optimization
Use smaller neural planner as fast-thinking model to reduce LLM calls (SwiftSage idea)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Small-scale experiments (only prompt-based methods, limited budget) limit generality of numerical claims.

Benchmarks used often have single gold paths, so success rates underrepresent valid alternative solutions.

When Not To Use

When you need formal guarantees of plan feasibility or safety (use symbolic planners instead).

When token budget or latency is strictly constrained.

Failure Modes

LLM hallucination: inventing non-existent actions or objects.

Generated plans violating environment constraints (infeasible plans).

Core Entities

Models

LLMs (general)text-davinci-003GPT-4LLaMAGPT-2Decision TransformerDRRN

Metrics

Success Rate (SR)Average Reward (AR)Token Expense (EX)

Datasets

ALFWorldScienceWorldHotPotQAFEVERMinecraft

Benchmarks

ALFWorldScienceWorldHotPotQAFEVER

Context Entities

Models

CodeXbgesmall-en-v1.5 (embedding model)

Metrics

Number of created tools (Minecraft)

Datasets

WebShopMind2WebWebArenaAgentBenchMiniWoB++