Planning with LLMs Papers — Parsed & Scored for Practitioners

Let LLMs translate problems and a classical planner find correct, often optimal, plans

0.70

0.60

0.70

84

LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.

Key finding

LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.

Numbers: BLOCKSWORLD 90% (LLM 15–20%); GRIPPERS 95% (LLM 35%) ; STORAGE 85% (LLM 0%)

LLMs fail at autonomous planning (~3% success) but their plans can be repaired and slightly help humans

1.00

0.60

0.40

31

If you plan to use LLMs for automated action sequencing or workflows, don't run them unsupervised — they rarely produce correct plans; use them as idea generators and pair with a certified planner or human review.

Key finding

LLMs rarely produce correct executable plans when used alone.

Numbers: GPT-3: 6/600 (1%); Instruct-GPT3: 41/600 (6.8%); BLOOM: 4/250 (1.6%); paper cites ≈3% average

Survey: five practical ways LLMs are used to plan agent behavior

0.40

0.60

29

LLM-driven planning can automate complex multi-step tasks, but higher success usually requires more model calls and tokens, so balance accuracy needs with token cost and latency.

Key finding

Spending more tokens (more generated ‘thinking’) tends to raise success.

Numbers: ALFWorld SR: ReAct 0.57 -> Reflexion 0.71; EX($): 152.18 -> 220.17 (Table 2).

CogEval: systematic tests show LLMs fail at cognitive maps and multi‑step planning

0.30

0.60

0.20

22

Do not assume LLMs can plan multi‑step tasks from text alone; failures scale with graph complexity and can cause incorrect or looping actions in planning applications.

Key finding

LLM, graph, domain, and condition strongly predict performance.

Numbers: LLM χ2=2357.87; graph χ2=3431.53; condition χ2=2080.04; domain χ2=458.74 (all p<.001)

Practical survey of single- vs. multi-agent designs, planning steps, and tool calling trade-offs

0.60

0.50

19

Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.

Key finding

ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.

Numbers: 6% hallucination (ReAct) vs 14% (CoT) on HotpotQA

WebAgent: combine an HTML-specialist LLM and a code LLM to plan, summarize long pages, and act by generating Python for real websites

0.60

0.70

0.60

16

WebAgent shows a practical path to robust web automation: use a small specialist model to understand long HTML and a capable code-generating LLM to act. That reduces brittle failures on real sites and drastically raises task success in human-supervised runs.

Key finding

Modular WebAgent dramatically improves real-site success rates.

Numbers: Success: real-estate 65% vs 10%; social-media 70% vs 20%; map 80% vs 10%

AdaPlanner: LLM planner that adaptively refines code-style plans from environment feedback

0.60

0.70

13

AdaPlanner cuts dependence on large labeled datasets and repeated LLM calls by adaptively revising code-style plans, saving annotation and API cost while improving performance on long-horizon text tasks.

Key finding

AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.

Numbers: Success rate 91.79% (ALFWorld Table 2).

TravelPlanner: a realistic travel-planning benchmark — GPT-4 reaches only 0.6% full success on test tasks

1.00

0.70

0.60

13

Current LLM agents are not yet reliable enough to fully automate complex multi-constraint planning; but they can draft plans quickly and cut human effort if paired with verification and robust data collection.

Key finding

State-of-the-art LLMs largely fail to produce fully feasible travel plans.

Numbers: GPT-4 final pass rate = 0.6% on test set (two-stage)

Use LLM agents and a fishbowl discussion to simulate participatory urban planning and improve resident satisfaction

0.30

0.60

0.50

11

Simulated multi-agent LLM planning can surface local needs early, reducing time and rehearsal costs before engaging humans; it helps test many “what-if” land-use options quickly while keeping service coverage competitive.

Key finding

Simulated participatory planning raised resident Satisfaction to 0.787 on HLG.

Numbers: Satisfaction 0.787 (HLG) vs 0.708 (best baseline DRL)

A manager–analyst LLM multi-agent that uses verbalized, episode-level belief updates (CVRF) plus daily CVaR alerts to improve trading and (小

0.45

0.60

0.50

11

FINCON shows that structuring LLMs like a small investment team plus two-tiered risk controls can raise backtested returns and Sharpe ratios while reducing chatter. This suggests a practical path for building LLM-based decision pipelines for small active portfolios and research prototypes.

Key finding

FINCON produces much higher cumulative returns on tested stocks than baselines.

Numbers: TSLA CR 82.871% vs buy-and-hold 6.425% (Table 2)

TextStarCraft II: a text-based StarCraft II benchmark and a Chain-of-Summarization (CoS) method that helps LLMs plan in real time

0.40

0.60

0.50

10

TextStarCraft II and CoS show that LLMs can handle high-level, time-sensitive strategy where visual micro-control is scripted; this enables low-cost experimentation with strategic agents and rapid prototyping of language-driven decision systems.

Key finding

Closed-source LLMs using full CoS beat the level-5 built-in AI in many trials.

Numbers: GPT-4: 12/20 wins, GPT3.5: 11/20 (Table 1)

Use LLMs (LightGPT) to control traffic lights with human-like reasoning and lower deployment cost

0.70

0.85

10

LLMLight enables interpretable, generalizable traffic control with much lower deployment cost than closed LLM APIs, making city-scale experiments and phased rollouts affordable.

Key finding

LightGPT (Llama2-13B) yields low travel times on evaluated datasets.

Numbers: ATT ≈ 274.03 s on Jinan/Hangzhou (Table 2/8).

BOLAA: orchestrating specialist LLM agents with a controller improves web navigation and reasoning on standard benchmarks

0.60

0.65

0.70

9

Splitting complex agent work into small, specialist LLMs coordinated by a controller can match or beat large single LLM agents and reduce compute cost by enabling smaller models to specialize.

Key finding

Orchestrating specialist agents (BOLAA) gives the best WebShop performance across many LLMs.

Numbers: gpt-3.5-turbo BOLAA reward=0.6567 vs ZS=0.5061 (Table 1)

An LLM agent that plans CRISPR experiments, designs guides and protocols, and was validated in a wet‑lab knockout

0.60

0.70

0.60

9

Automating CRISPR design reduces expert time, speeds prototyping, and lowers error risk in early‑stage research; it can cut planning cycles and standardize lab protocols for teams without CRISPR specialists.

Key finding

Domain‑augmented agent scored higher than general ChatGPT on expert design ratings.

Numbers: 12 experts; 1–5 rating scale; CRISPR‑GPT > ChatGPT 3.5/4 across Accuracy, Reasoning, Completeness, Conciseness

A practical review of how LLMs build, extend, and are tested as autonomous agents

0.40

0.50

0.60

9

LLM agents can automate complex multi-step digital tasks but are currently brittle; invest in tool integration, retrieval, and realistic evaluation before production to avoid failures and user trust loss.

Key finding

Agents built for realistic web tasks still perform far below humans.

Numbers: GPT-4 agent task success 14.41% vs human 78.24%

Survey of LLM-based medical agents: architectures, applications, and safety gaps

0.40

0.60

8

LLM agents can cut clinician workload and improve documentation and training, but current models need workflow-style validation, bias checks, and human oversight before clinical deployment.

Key finding

Surveyed literature size and scope.

Numbers: 60 studies reviewed (from ~300 initial hits, 80 shortlisted).

AgentBoard: a 9-task, 1,013-environment benchmark + toolkit that tracks stepwise progress for multi-turn LLM agents

0.70

0.60

0.50

8

AgentBoard gives stepwise progress signals and diagnostic visualizations so teams can see partial improvements, debug grounding/formatting faults, and prioritize model upgrades or targeted fine-tuning instead of chasing binary success.

Key finding

Fine-grained progress rate exposes partial progress that success rate misses.

Numbers: Example: Llama2-13b progress 18.9% vs Mistral-7b 24.6% while both have ~2–3% success

Aviary: train small open LLM agents to solve multi-step biology tasks and match frontier models at far lower inference cost

0.70

0.60

0.80

7

You can train modest open LLMs to match or beat larger closed models and humans on multi-step scientific workflows while cutting inference cost by orders of magnitude, enabling cheaper high-throughput automation.

Key finding

A trained Llama-3.1-8B-Instruct agent reached 0.89 test accuracy on SeqQA using large-sample majority voting.

Numbers: 0.89 accuracy (SeqQA, test; many-sample consensus)

PHIA: an agent that uses code + web search to turn wearable time-series into personalized health insights

0.50

0.65

0.40

7

Agentic LLMs that run verified code and fetch trusted web facts can unlock personalized insights from wearable data—improving product value for health apps while reducing numeric errors and buggy analyses.

Key finding

PHIA answers objective numeric wearable queries with high accuracy

Numbers: 84% exact-match accuracy on 4,000 objective queries

Use LLMs to patch rule-based driving planners and cut dangerous scenarios on nuPlan.

0.60

0.55

0.40

6

A language model can be used to patch edge-case failures of a strong rule-based planner and reduce dangerous scenarios without retraining the core planner, but latency, cost, and hallucination risk must be managed.

Key finding

Parameterizing the base planner with an LLM reduces dangerous driving scenarios.

Numbers: 11% fewer dangerous events vs PDM-Closed (nuPlan val14)

Use scene graphs + LLMs to split long robot goals into short sub-goals so classical planners solve them fast and reliably

0.60

0.45

0.55

6

DELTA turns large, slow planning problems into fast, reliable sub-problems so robots can plan long household workflows quickly and with higher success, cutting compute and time costs when paired with a strong LLM.

Key finding

DELTA with GPT-4o achieves highest success rates across evaluated domains.

Numbers: PC 98%, Dining 100%, Cleaning 80%, Office 74.67% (Table II)

Use Monte Carlo Tree Self-Refine (MCTSr) to boost LLaMA-3 8B on hard math problems

0.50

0.70

0.60

6

MCTSr lets smaller open LLMs solve many multi-step math tasks nearly as well as large closed models, reducing reliance on costly closed APIs for arithmetic and structured reasoning workloads.

Key finding

More MCTSr rollouts steadily improve accuracy on math benchmarks.

Numbers: MATH overall: 24.36% → 58.24% (Zero-Shot CoT → 8-rollouts)

PPNL: a controlled benchmark showing GPT-4 plans locally well but fails at long-term navigation

0.40

0.60

0.35

6

LLMs can handle short-range navigation when prompted interactively, but they are not yet reliable for long-distance or out-of-distribution path planning; use fine-tuned models for predictable, repeated environments and ReAct-like prompting for ad-hoc, locally-correct behavior.

Key finding

GPT-4 with ReAct achieved very high in-distribution success but often relies on short trials.

Numbers: Success = 96.1% (Table 3)

Automatic, pseudocode-based evaluation and a 100-protocol BIOPROT dataset to test LLM planning for lab protocols

0.40

0.60

0.40

6

BIOPROT and the pseudocode evaluation let teams measure and improve LLM planning for lab protocols quickly, reducing expert labeling and enabling reproducible protocol generation for automation workflows.

Key finding

BIOPROT contains 100 biology protocols translated into pseudocode.

Numbers: 100 protocols; avg steps 12.5; avg pseudofunctions per protocol 10.3