Overview
The pipeline is practical and relies on existing robust planners; experiments across 7 domains and a real-robot demo give moderate-to-strong evidence. Main weakness: dependency on correct PDDL generation and on human-provided domain files.
Citations84
Evidence Strength0.80
Confidence0.87
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.
Who Should Care
Summary TLDR
LLM+P uses a large language model (LLM) to translate a natural-language planning problem into PDDL (a planner input), runs a fast classical planner to find a correct or optimal plan, then translates that plan back to natural language. This pipeline solves long-horizon robot planning tasks far more reliably than using LLMs alone, provided a domain PDDL and a short example are supplied.
Problem Statement
LLMs often produce plausible but incorrect long-horizon plans because they lack reliable symbolic reasoning about actions and preconditions. The paper asks: can we keep LLMs for language work (translation) and rely on classical planners for correct, optimal planning?
Main Contribution
Introduce LLM+P, a pipeline that: (1) asks an LLM to convert a natural-language planning problem into PDDL, (2) runs a classical planner on that PDDL, and (3) translates the planner's plan back to natural language or robot actions.
Provide a benchmark suite of seven robot planning domains (20 tasks each) derived from standard PDDL generators to evaluate planning performance.
Key Findings
LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.
Context (a single example problem + PDDL) is crucial for correct PDDL generation by the LLM.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| optimal plan success rate | BLOCKSWORLD 90% | LLM only 15–20% | ≈ +70pp | BLOCKSWORLD (20 tasks) | Table I; Section V-C | Table I |
| optimal plan success rate | GRIPPERS 95% (100% with sub-optimal alias) | LLM only 35% (some sub-optimal plans) | ≈ +60pp | GRIPPERS (20 tasks) | Table I; Section V-C | Table I |
What To Try In 7 Days
If you have a robotics task with defined actions, write a domain PDDL and try translating a few natural-language tasks with GPT-4 into problem PDDL; run FAST-DOWNWARD to compare pl
Create a 1-shot example (problem + PDDL) and include it in prompts; measure whether the planner now finds solutions.
Deploy the pipeline for a small field demo (e.g., pick-and-place or tidy-up) and compare execution cost and failure rate to LLM-only plans.
Agent Features
Memory
Planning
Tool Use
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Requires a domain PDDL file for each domain; authors assume a human provides it (Section III-C).
LLM must be given a short (problem, PDDL) demonstration; without it produced PDDL is often incorrect.
When Not To Use
When you cannot produce a reliable domain PDDL (open-world tasks without a fixed action set).
When perception and low-level motion tightly couple to symbolic planning and no abstraction to PDDL exists.
Failure Modes
LLM omits or mangles initial conditions or uses made-up predicates → planner finds no plan.
LLM-only planning produces infeasible plans because it fails to track preconditions (e.g., ON, CLEAR).

