Let LLMs translate problems and a classical planner find correct, often optimal, plans

Overview

Decision SnapshotReady For Pilot

The pipeline is practical and relies on existing robust planners; experiments across 7 domains and a real-robot demo give moderate-to-strong evidence. Main weakness: dependency on correct PDDL generation and on human-provided domain files.

Citations84

Evidence Strength0.80

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, Peter Stone

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.

Who Should Care

Product Manager Engineering Lead ML Engineer Founder

Summary TLDR

LLM+P uses a large language model (LLM) to translate a natural-language planning problem into PDDL (a planner input), runs a fast classical planner to find a correct or optimal plan, then translates that plan back to natural language. This pipeline solves long-horizon robot planning tasks far more reliably than using LLMs alone, provided a domain PDDL and a short example are supplied.

Problem Statement

LLMs often produce plausible but incorrect long-horizon plans because they lack reliable symbolic reasoning about actions and preconditions. The paper asks: can we keep LLMs for language work (translation) and rely on classical planners for correct, optimal planning?

Main Contribution

Introduce LLM+P, a pipeline that: (1) asks an LLM to convert a natural-language planning problem into PDDL, (2) runs a classical planner on that PDDL, and (3) translates the planner's plan back to natural language or robot actions.

Provide a benchmark suite of seven robot planning domains (20 tasks each) derived from standard PDDL generators to evaluate planning performance.

Key Findings

LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.

NumbersBLOCKSWORLD 90% (LLM 15–20%); GRIPPERS 95% (LLM 35%) ; STORAGE 85% (LLM 0%)

Practical UseIf you can supply a domain PDDL and a short example, use LLM+P to get reliable plans instead of asking the LLM to plan directly.

Evidence RefTable I; Section V-C

Context (a single example problem + PDDL) is crucial for correct PDDL generation by the LLM.

NumbersWithout context many generated PDDL files are incorrect; experiments show high drop in solver success without context (d

Practical UseAlways include a short (problem,PDDL) demonstration when prompting the LLM to translate natural language to PDDL.

Evidence RefSection III-A, Section V-C

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
optimal plan success rate	BLOCKSWORLD 90%	LLM only 15–20%	≈ +70pp	BLOCKSWORLD (20 tasks)	Table I; Section V-C	Table I
optimal plan success rate	GRIPPERS 95% (100% with sub-optimal alias)	LLM only 35% (some sub-optimal plans)	≈ +60pp	GRIPPERS (20 tasks)	Table I; Section V-C	Table I

What To Try In 7 Days

If you have a robotics task with defined actions, write a domain PDDL and try translating a few natural-language tasks with GPT-4 into problem PDDL; run FAST-DOWNWARD to compare pl

Create a 1-shot example (problem + PDDL) and include it in prompts; measure whether the planner now finds solutions.

Deploy the pipeline for a small field demo (e.g., pick-and-place or tidy-up) and compare execution cost and failure rate to LLM-only plans.

Agent Features

Memory

In-context learning (single example demonstration)

Planning

Translate NL to PDDL (LLM)Run classical PDDL planner to produce optimal planTranslate symbolic plan to natural language or robot actions

Tool Use

FAST-DOWNWARDPDDL (domain + problem files)

Is Agentic

Yes

Architectures

LLM (GPT-4) + classical planner (FAST-DOWNWARD)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Cranial-XIX/llm-pddl.git

Data URLs

https://github.com/Cranial-XIX/llm-pddl.git (benchmark PDDL files and generator usage)

Risks & Boundaries

Limitations

Requires a domain PDDL file for each domain; authors assume a human provides it (Section III-C).

LLM must be given a short (problem, PDDL) demonstration; without it produced PDDL is often incorrect.

When Not To Use

When you cannot produce a reliable domain PDDL (open-world tasks without a fixed action set).

When perception and low-level motion tightly couple to symbolic planning and no abstraction to PDDL exists.

Failure Modes

LLM omits or mangles initial conditions or uses made-up predicates → planner finds no plan.

LLM-only planning produces infeasible plans because it fails to track preconditions (e.g., ON, CLEAR).

Core Entities

Models

GPT-4

Metrics

Success rate % of producing (optimal) plansPlan cost (execution metric used in robot demo)

Datasets

7 PDDL domains (BLOCKSWORLD, BARMAN, FLOORTILE, GRIPPERS, STORAGE, TERMES, TYREWORLD) with 20 tasksPDDL generators (Seipp et al. 2022)

Benchmarks

7-domain robot planning benchmark (authors' suite)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.

Context (a single example problem + PDDL) is crucial for correct PDDL generation by the LLM.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding