Let LLMs translate problems and a classical planner find correct, often optimal, plans

April 22, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

84

Authors

Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, Peter Stone

Links

Abstract / PDF

Why It Matters For Business

LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.

Summary TLDR

LLM+P uses a large language model (LLM) to translate a natural-language planning problem into PDDL (a planner input), runs a fast classical planner to find a correct or optimal plan, then translates that plan back to natural language. This pipeline solves long-horizon robot planning tasks far more reliably than using LLMs alone, provided a domain PDDL and a short example are supplied.

Problem Statement

LLMs often produce plausible but incorrect long-horizon plans because they lack reliable symbolic reasoning about actions and preconditions. The paper asks: can we keep LLMs for language work (translation) and rely on classical planners for correct, optimal planning?

Main Contribution

Introduce LLM+P, a pipeline that: (1) asks an LLM to convert a natural-language planning problem into PDDL, (2) runs a classical planner on that PDDL, and (3) translates the planner's plan back to natural language or robot actions.

Provide a benchmark suite of seven robot planning domains (20 tasks each) derived from standard PDDL generators to evaluate planning performance.

Empirically show LLM+P yields far higher correct/optimal plan rates than LLM-only methods and demonstrate a real-robot tidy-up task where LLM+P finds a lower-cost plan.

Key Findings

LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.

NumbersBLOCKSWORLD 90% (LLM 15–20%); GRIPPERS 95% (LLM 35%) ; STORAGE 85% (LLM 0%)

Context (a single example problem + PDDL) is crucial for correct PDDL generation by the LLM.

NumbersWithout context many generated PDDL files are incorrect; experiments show high drop in solver success without context (d

LLM-only planning often fails because it cannot reliably track action preconditions and state predicates.

NumbersLLM-AS-P produced infeasible plans in most domains (many 0% success rates across domains)

LLM+P produced a lower-cost plan on a real robot tidy-up task: cost 22 vs LLM-AS-P cost 31.

NumbersRobot demo cost 22 (LLM-AS-P cost 31)

Some domains remain hard for LLM+P when the LLM generates malformed PDDL (missing initial facts or wrong predicates).

NumbersFLOORTILE: LLM+P 0% (due to disconnected tiles / missing init facts)

Results

optimal plan success rate

ValueBLOCKSWORLD 90%

BaselineLLM only 15–20%

optimal plan success rate

ValueGRIPPERS 95% (100% with sub-optimal alias)

BaselineLLM only 35% (some sub-optimal plans)

optimal plan success rate

ValueSTORAGE 85%

BaselineLLM only 0%

optimal vs sub-optimal coverage

ValueBARMAN 20% optimal, 100% with sub-optimal alias

BaselineLLM only 0%

real-robot plan cost

ValueLLM+P plan cost 22

BaselineLLM-AS-P plan cost 31

Who Should Care

What To Try In 7 Days

If you have a robotics task with defined actions, write a domain PDDL and try translating a few natural-language tasks with GPT-4 into problem PDDL; run FAST-DOWNWARD to compare pl

Create a 1-shot example (problem + PDDL) and include it in prompts; measure whether the planner now finds solutions.

Deploy the pipeline for a small field demo (e.g., pick-and-place or tidy-up) and compare execution cost and failure rate to LLM-only plans.

Agent Features

Memory

  • In-context learning (single example demonstration)

Planning

  • Translate NL to PDDL (LLM)
  • Run classical PDDL planner to produce optimal plan
  • Translate symbolic plan to natural language or robot actions

Tool Use

  • FAST-DOWNWARD
  • PDDL (domain + problem files)

Is Agentic

true

Architectures

  • LLM (GPT-4) + classical planner (FAST-DOWNWARD)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires a domain PDDL file for each domain; authors assume a human provides it (Section III-C).
  • LLM must be given a short (problem, PDDL) demonstration; without it produced PDDL is often incorrect.
  • LLM+P does not auto-detect when a prompt should be routed to the planner; the trigger must be supplied externally.
  • Translation errors (missing initial facts, wrong predicates) can make solvable tasks unsolvable.

When Not To Use

  • When you cannot produce a reliable domain PDDL (open-world tasks without a fixed action set).
  • When perception and low-level motion tightly couple to symbolic planning and no abstraction to PDDL exists.
  • When you cannot validate or inspect the generated PDDL before execution.

Failure Modes

  • LLM omits or mangles initial conditions or uses made-up predicates → planner finds no plan.
  • LLM-only planning produces infeasible plans because it fails to track preconditions (e.g., ON, CLEAR).
  • Tree-of-Thought style LLM search times out due to many LLM calls and large branching in long-horizon tasks.

Core Entities

Models

  • GPT-4

Metrics

  • Success rate % of producing (optimal) plans
  • Plan cost (execution metric used in robot demo)

Datasets

  • 7 PDDL domains (BLOCKSWORLD, BARMAN, FLOORTILE, GRIPPERS, STORAGE, TERMES, TYREWORLD) with 20 tasks
  • PDDL generators (Seipp et al. 2022)

Benchmarks

  • 7-domain robot planning benchmark (authors' suite)