LLMs fail at autonomous planning (~3% success) but their plans can be repaired and slightly help humans

February 13, 20238 min

Overview

Decision SnapshotReady For Pilot

The evidence is solid for small symbolic domains: autonomous LLM planning fails often; heuristic seeding plus a sound planner works reliably; human-assist gains are small and not statistically proven.

Citations31

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 100%

Novelty: 60%

Authors

Karthik Valmeekam, Sarath Sreedharan, Matthew Marquez, Alberto Olmo, Subbarao Kambhampati

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to use LLMs for automated action sequencing or workflows, don't run them unsupervised — they rarely produce correct plans; use them as idea generators and pair with a certified planner or human review.

Who Should Care

Summary TLDR

The authors build a PDDL-grounded benchmark for commonsense planning (Blocksworld-style) and test GPT-3 variants and BLOOM in three modes: autonomous plan generation, heuristic seeding for classical planners, and human-in-the-loop. Autonomous LLM planning is very poor (overall ~1–7% depending on model; authors summarize ≈3%), fine-tuning raises success to ~16–22% on seen domain data, and hiding action names collapses performance. Feeding LLM plans as seeds to a sound planner (LPG) reliably produces correct plans (LPG repaired all seeds). Human subjects do much better than LLMs (78% valid), and LLM suggestions give a small non-significant lift (74% → 82%). The benchmark and tools are public.

Problem Statement

Do general-purpose LLMs (transformer language models) know how to generate and evaluate simple executable plans? And can they act as useful heuristic guides for sound planners or human planners? The paper tests LLMs on formal, symbolic planning problems where correctness can be checked automatically.

Main Contribution

A public, PDDL-backed benchmark and testbed for evaluating planning abilities of LLMs using Blocksworld-style tasks and automated validators.

A three-mode evaluation protocol: autonomous generation, heuristic seeding for a sound planner (LPG), and human-in-the-loop studies with controlled user experiments.

Key Findings

LLMs rarely produce correct executable plans when used alone.

NumbersGPT-3: 6/600 (1%); Instruct-GPT3: 41/600 (6.8%); BLOOM: 4/250 (1.6%); paper cites ≈3% average

Practical UseDo not rely on off-the-shelf LLMs to autonomously generate correct plans; use automated verification or alternative methods.

Evidence RefTable 1; Abstract

A classical planner (LPG) can reliably repair LLM-generated seed plans.

NumbersAll Instruct‑GPT3 seeds tested (600) repaired to valid plans; avg Levenshtein edit distance = 7.22 vs final plan length

Practical UseUse LLMs to produce rough plan sketches, then run a sound planner to fix and certify the plan.

Evidence RefSection 6.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Autonomous plan generation successGPT-3 1%; Instruct-GPT3 6.8%; BLOOM 1.6%; paper average ≈3%Human baseline 78% validLLMs far lower than humansBlocksworld instances (600 for GPT-3/Instruct, 250 for BLOOM)Table 1; Section 6.1Table 1
Optimal planning successGPT-3 0.3%; Instruct-GPT3 5.8%; BLOOM 2%Human baseline optimality 89.7% (of valid)Very low optimal outputs from LLMsBlocksworld optimal planning instancesTable 1; Section 6.1Table 1

What To Try In 7 Days

Run the authors' benchmark on your domain to measure LLM plan quality.

Use an LLM to draft a seed plan and feed it to a sound planner (LPG or similar) to repair and certify outputs.

If you need higher coverage, fine-tune an LLM on domain transition examples, then validate every output automatically.

Agent Features

Planning
autonomous generationheuristic seedinghuman-in-the-loop
Tool Use
LPG (local-search planner)Fast-Downward (optimal planner)VAL (validator)
Frameworks
PDDL
Is Agentic

Yes

Architectures
Transformer LLM
Collaboration
human-in-the-loop assistanceLLM → classical planner pipeline

Optimization Features

System Optimization
Using LLM only for seeds and delegate correctness to symbolic planner
Training Optimization
Fine-tuning on domain examples (improves but limited)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark is grounded mainly in Blocksworld — a small, symbolic, synthetic domain.

Evaluations use a limited set of LLMs (GPT-3 variants and BLOOM) and specific prompt templates.

When Not To Use

Do not use an LLM alone for mission-critical planning or automation that requires guaranteed executability.

Avoid deploying LLM-generated plans without automatic validation in environments where errors are costly.

Failure Modes

Generates actions that violate preconditions or use wrong objects (non-executable plans).

Relies on surface names and pattern matching; fails when action/predicate names are disguised.

Core Entities

Models

GPT-3 (davinci)Instruct-GPT3 (text-davinci-002)BLOOM (176B)

Metrics

Correct plan count / instanceOptimality (cost match)Levenshtein edit distance (plan edit)Accuracy

Datasets

Blocksworld synthetic instances (600/500 instance splits)Mystery (disguised) Blocksworld variants

Benchmarks

GPT-Plan-Benchmark (PDDL-backed Blocksworld suite)

Context Entities

Models

Fine-tuned GPT-3 (on blocksworld examples)

Metrics

p-values from t-tests (time and cognitive load)

Datasets

Human study set (50 participants; Prolific)