Overview
The survey aggregates recent methods and small experiments to show practical trade-offs; its evidence is useful for design guidance but not definitive for every domain.
Citations29
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
LLM-driven planning can automate complex multi-step tasks, but higher success usually requires more model calls and tokens, so balance accuracy needs with token cost and latency.
Who Should Care
Summary TLDR
This short survey organizes research on using large language models (LLMs) as the planning core of autonomous agents. It proposes a clear five-way taxonomy — Task Decomposition, Multi-plan Selection, External Planner-Aided, Reflection & Refinement, and Memory-Augmented — analyzes representative methods in each group, and reports small experiments on four benchmarks. Key takeaways: (1) more generated text (tokens) tends to improve success but raises cost; (2) few-shot examples beat zero-shot for complex tasks; (3) reflection/self-correction helps on hard interactive tasks; (4) RAG and fine-tuning trade off update cost vs capacity. It highlights recurring failure modes (hallucination, infeasb.
Problem Statement
How can we use LLMs to generate reliable, feasible, and efficient multi-step plans for autonomous agents? The paper collects and organizes methods, compares representative approaches, and points out practical limits such as hallucinations, infeasible plans, high token costs, and weak fine-grained evaluation.
Main Contribution
A compact taxonomy of LLM-based planning methods into five directions: Task Decomposition, Multi-plan Selection, External Planner-Aided, Reflection & Refinement, and Memory-Augmented Planning.
A concise analysis of representative works, strengths, weaknesses, and common failure modes for each direction.
Key Findings
Spending more tokens (more generated ‘thinking’) tends to raise success.
Few-shot Chain-of-Thought outperforms zero-shot on complex QA tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Success Rate (SR) | Reflexion 0.71 vs ReAct 0.57 on ALFWorld | ReAct 0.57 | +0.14 | ALFWorld | Reflexion shows higher SR than ReAct on ALFWorld in Table 2 | Table 2 |
| Success Rate (SR) | Zero-shot CoT 0.01 vs Few-shot CoT 0.32 on HotPotQA | Zero-shot CoT 0.01 | +0.31 | HotPotQA | Zero-shot CoT severely underperforms compared to few-shot CoT on HotPotQA (Table 2) | Table 2 |
What To Try In 7 Days
Run a few-shot CoT prompt on one task and compare with zero-shot to measure gain.
Add a single-shot reflection loop (generate feedback + one retry) to an agent and measure success uplift.
Prototype a small RAG memory using vector embeddings and FAISS for one agent flow.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Small-scale experiments (only prompt-based methods, limited budget) limit generality of numerical claims.
Benchmarks used often have single gold paths, so success rates underrepresent valid alternative solutions.
When Not To Use
When you need formal guarantees of plan feasibility or safety (use symbolic planners instead).
When token budget or latency is strictly constrained.
Failure Modes
LLM hallucination: inventing non-existent actions or objects.
Generated plans violating environment constraints (infeasible plans).

