LLMs fail at autonomous planning (~3% success) but their plans can be repaired and slightly help humans

Overview

Decision SnapshotReady For Pilot

The evidence is solid for small symbolic domains: autonomous LLM planning fails often; heuristic seeding plus a sound planner works reliably; human-assist gains are small and not statistically proven.

Citations31

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 100%

Novelty: 60%

Authors

Karthik Valmeekam, Sarath Sreedharan, Matthew Marquez, Alberto Olmo, Subbarao Kambhampati

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to use LLMs for automated action sequencing or workflows, don't run them unsupervised — they rarely produce correct plans; use them as idea generators and pair with a certified planner or human review.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The authors build a PDDL-grounded benchmark for commonsense planning (Blocksworld-style) and test GPT-3 variants and BLOOM in three modes: autonomous plan generation, heuristic seeding for classical planners, and human-in-the-loop. Autonomous LLM planning is very poor (overall ~1–7% depending on model; authors summarize ≈3%), fine-tuning raises success to ~16–22% on seen domain data, and hiding action names collapses performance. Feeding LLM plans as seeds to a sound planner (LPG) reliably produces correct plans (LPG repaired all seeds). Human subjects do much better than LLMs (78% valid), and LLM suggestions give a small non-significant lift (74% → 82%). The benchmark and tools are public.

Problem Statement

Do general-purpose LLMs (transformer language models) know how to generate and evaluate simple executable plans? And can they act as useful heuristic guides for sound planners or human planners? The paper tests LLMs on formal, symbolic planning problems where correctness can be checked automatically.

Main Contribution

A public, PDDL-backed benchmark and testbed for evaluating planning abilities of LLMs using Blocksworld-style tasks and automated validators.

A three-mode evaluation protocol: autonomous generation, heuristic seeding for a sound planner (LPG), and human-in-the-loop studies with controlled user experiments.

Key Findings

LLMs rarely produce correct executable plans when used alone.

NumbersGPT-3: 6/600 (1%); Instruct-GPT3: 41/600 (6.8%); BLOOM: 4/250 (1.6%); paper cites ≈3% average

Practical UseDo not rely on off-the-shelf LLMs to autonomously generate correct plans; use automated verification or alternative methods.

Evidence RefTable 1; Abstract

A classical planner (LPG) can reliably repair LLM-generated seed plans.

NumbersAll Instruct‑GPT3 seeds tested (600) repaired to valid plans; avg Levenshtein edit distance = 7.22 vs final plan length

Practical UseUse LLMs to produce rough plan sketches, then run a sound planner to fix and certify the plan.

Evidence RefSection 6.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Autonomous plan generation success	GPT-3 1%; Instruct-GPT3 6.8%; BLOOM 1.6%; paper average ≈3%	Human baseline 78% valid	LLMs far lower than humans	Blocksworld instances (600 for GPT-3/Instruct, 250 for BLOOM)	Table 1; Section 6.1	Table 1
Optimal planning success	GPT-3 0.3%; Instruct-GPT3 5.8%; BLOOM 2%	Human baseline optimality 89.7% (of valid)	Very low optimal outputs from LLMs	Blocksworld optimal planning instances	Table 1; Section 6.1	Table 1

What To Try In 7 Days

Run the authors' benchmark on your domain to measure LLM plan quality.

Use an LLM to draft a seed plan and feed it to a sound planner (LPG or similar) to repair and certify outputs.

If you need higher coverage, fine-tune an LLM on domain transition examples, then validate every output automatically.

Agent Features

Planning

autonomous generationheuristic seedinghuman-in-the-loop

Tool Use

LPG (local-search planner)Fast-Downward (optimal planner)VAL (validator)

Frameworks

PDDL

Is Agentic

Yes

Architectures

Transformer LLM

Collaboration

human-in-the-loop assistanceLLM → classical planner pipeline

Optimization Features

System Optimization

Using LLM only for seeds and delegate correctness to symbolic planner

Training Optimization

Fine-tuning on domain examples (improves but limited)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/karthikv792/gpt-plan-benchmark

Data URLs

https://github.com/karthikv792/gpt-plan-benchmark

Risks & Boundaries

Limitations

Benchmark is grounded mainly in Blocksworld — a small, symbolic, synthetic domain.

Evaluations use a limited set of LLMs (GPT-3 variants and BLOOM) and specific prompt templates.

When Not To Use

Do not use an LLM alone for mission-critical planning or automation that requires guaranteed executability.

Avoid deploying LLM-generated plans without automatic validation in environments where errors are costly.

Failure Modes

Generates actions that violate preconditions or use wrong objects (non-executable plans).

Relies on surface names and pattern matching; fails when action/predicate names are disguised.

Core Entities

Models

GPT-3 (davinci)Instruct-GPT3 (text-davinci-002)BLOOM (176B)

Metrics

Correct plan count / instanceOptimality (cost match)Levenshtein edit distance (plan edit)Accuracy

Datasets

Blocksworld synthetic instances (600/500 instance splits)Mystery (disguised) Blocksworld variants

Benchmarks

GPT-Plan-Benchmark (PDDL-backed Blocksworld suite)

Context Entities

Models

Fine-tuned GPT-3 (on blocksworld examples)

Metrics

p-values from t-tests (time and cognitive load)

Datasets

Human study set (50 participants; Prolific)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs rarely produce correct executable plans when used alone.

A classical planner (LPG) can reliably repair LLM-generated seed plans.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding