Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
84
Why It Matters For Business
LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.
Summary TLDR
LLM+P uses a large language model (LLM) to translate a natural-language planning problem into PDDL (a planner input), runs a fast classical planner to find a correct or optimal plan, then translates that plan back to natural language. This pipeline solves long-horizon robot planning tasks far more reliably than using LLMs alone, provided a domain PDDL and a short example are supplied.
Problem Statement
LLMs often produce plausible but incorrect long-horizon plans because they lack reliable symbolic reasoning about actions and preconditions. The paper asks: can we keep LLMs for language work (translation) and rely on classical planners for correct, optimal planning?
Main Contribution
Introduce LLM+P, a pipeline that: (1) asks an LLM to convert a natural-language planning problem into PDDL, (2) runs a classical planner on that PDDL, and (3) translates the planner's plan back to natural language or robot actions.
Provide a benchmark suite of seven robot planning domains (20 tasks each) derived from standard PDDL generators to evaluate planning performance.
Empirically show LLM+P yields far higher correct/optimal plan rates than LLM-only methods and demonstrate a real-robot tidy-up task where LLM+P finds a lower-cost plan.
Key Findings
LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.
Context (a single example problem + PDDL) is crucial for correct PDDL generation by the LLM.
LLM-only planning often fails because it cannot reliably track action preconditions and state predicates.
LLM+P produced a lower-cost plan on a real robot tidy-up task: cost 22 vs LLM-AS-P cost 31.
Some domains remain hard for LLM+P when the LLM generates malformed PDDL (missing initial facts or wrong predicates).
Results
optimal plan success rate
optimal plan success rate
optimal plan success rate
optimal vs sub-optimal coverage
real-robot plan cost
Who Should Care
What To Try In 7 Days
If you have a robotics task with defined actions, write a domain PDDL and try translating a few natural-language tasks with GPT-4 into problem PDDL; run FAST-DOWNWARD to compare pl
Create a 1-shot example (problem + PDDL) and include it in prompts; measure whether the planner now finds solutions.
Deploy the pipeline for a small field demo (e.g., pick-and-place or tidy-up) and compare execution cost and failure rate to LLM-only plans.
Agent Features
Memory
- In-context learning (single example demonstration)
Planning
- Translate NL to PDDL (LLM)
- Run classical PDDL planner to produce optimal plan
- Translate symbolic plan to natural language or robot actions
Tool Use
- FAST-DOWNWARD
- PDDL (domain + problem files)
Is Agentic
true
Architectures
- LLM (GPT-4) + classical planner (FAST-DOWNWARD)
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires a domain PDDL file for each domain; authors assume a human provides it (Section III-C).
- LLM must be given a short (problem, PDDL) demonstration; without it produced PDDL is often incorrect.
- LLM+P does not auto-detect when a prompt should be routed to the planner; the trigger must be supplied externally.
- Translation errors (missing initial facts, wrong predicates) can make solvable tasks unsolvable.
When Not To Use
- When you cannot produce a reliable domain PDDL (open-world tasks without a fixed action set).
- When perception and low-level motion tightly couple to symbolic planning and no abstraction to PDDL exists.
- When you cannot validate or inspect the generated PDDL before execution.
Failure Modes
- LLM omits or mangles initial conditions or uses made-up predicates → planner finds no plan.
- LLM-only planning produces infeasible plans because it fails to track preconditions (e.g., ON, CLEAR).
- Tree-of-Thought style LLM search times out due to many LLM calls and large branching in long-horizon tasks.
Core Entities
Models
- GPT-4
Metrics
- Success rate % of producing (optimal) plans
- Plan cost (execution metric used in robot demo)
Datasets
- 7 PDDL domains (BLOCKSWORLD, BARMAN, FLOORTILE, GRIPPERS, STORAGE, TERMES, TYREWORLD) with 20 tasks
- PDDL generators (Seipp et al. 2022)
Benchmarks
- 7-domain robot planning benchmark (authors' suite)

