PPNL: a controlled benchmark showing GPT-4 plans locally well but fails at long-term navigation

October 5, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.35

Citation Count

6

Authors

Mohamed Aghzal, Erion Plaku, Ziyu Yao

Links

Abstract / PDF

Why It Matters For Business

LLMs can handle short-range navigation when prompted interactively, but they are not yet reliable for long-distance or out-of-distribution path planning; use fine-tuned models for predictable, repeated environments and ReAct-like prompting for ad-hoc, locally-correct behavior.

Summary TLDR

The authors introduce PPNL, a synthetic grid-based benchmark to test LLMs' spatial-temporal reasoning via path planning. They test GPT-4 with four prompting styles (naive few-shot, action-and-effect, Chain-of-Thought, ReAct) and fine-tune BART/T5. Best in-context result: GPT-4+ReAct reaches 96.1% success on in-distribution 6×6 grids but shows limited long-horizon planning and low unreachable-goal detection. Fine-tuned T5 reaches ~98% in-distribution but fails to generalize to larger or denser grids. The benchmark and code will be released.

Problem Statement

Do text-only LLMs understand and execute long-horizon spatial plans? The paper builds a controlled grid-world benchmark where models must read a natural-language description of obstacles, start and goal(s), then output an action sequence that reaches the goal(s) while avoiding obstacles and obeying ordering constraints.

Main Contribution

PPNL: a synthetic 2D grid benchmark for spatial-temporal reasoning and path planning.

Systematic evaluation of GPT-4 with four prompting strategies: naive few-shot, action-and-effect, Chain-of-Thought (CoT), and ReAct (interleaved reasoning+acting).

Fine-tuned baselines (BART, T5) and analysis of in-distribution vs out-of-distribution generalization.

Empirical findings: ReAct GPT-4 excels locally but lacks long-horizon planning; fine-tuned models perform well IID but fail OOD.

Release plan: dataset generation code, prompts, and implementations for reproducible research.

Key Findings

GPT-4 with ReAct achieved very high in-distribution success but often relies on short trials.

NumbersSuccess = 96.1% (Table 3)

Action-and-effect prompting substantially improves naive few-shot performance.

NumbersAction-effect success 75.7% vs naive 54.2% (15-shot): +21.5 pp (Table 3)

Fine-tuned small models solve IID tasks nearly perfectly but fail to generalize to larger grids.

NumbersT5-base success IID = 97.9% vs 7×7 OOD = 54.8% (Tables 3,4)

Models poorly detect unreachable goals.

NumbersUnreachable accuracy ≈ 0% for GPT-4 prompting methods (Table 3)

Results

Success Rate (GPT-4, ReAct, in-distribution 6×6)

Value96.1%

BaselineNaive few-shot (15-shot) 54.2%

Success Rate (Action-and-Effect vs Naive few-shot)

Value75.7% (Action-effect)

Baseline54.2% (Naive, 15-shot)

Success Rate (T5-base fine-tuned, in-distribution)

Value97.9%

BaselineGPT-4 ReAct 96.1%

Success Rate (T5-base fine-tuned, OOD 7×7)

Value54.8%

BaselineT5-base IID 97.9%

Accuracy

Value≈0%

BaselineFine-tuned models up to 58.8% (BART) in some splits

Who Should Care

What To Try In 7 Days

Run ReAct-style iterative prompting when using GPT-4 for short navigation tasks to boost local success.

Fine-tune a small seq2seq model (T5) on your environment if tasks are repeatable and constrained.

Add explicit reachability checks (graph search) before asking an LLM to generate full plans.

Agent Features

Memory

  • local step-by-step state given via prompts (no persistent external memory)

Planning

  • ReAct: interleaved perceive-reason-act
  • Chain-of-Thought: stepwise internal reasoning
  • Action-and-Effect: explicit state update in prompt

Tool Use

  • No external runtime tools used by models; A* and TSP used offline for ground truth

Frameworks

  • ReAct
  • Chain-of-Thought
  • Action-and-Effect prompting

Is Agentic

true

Architectures

  • autoregressive (GPT-4)
  • seq2seq (BART, T5)

Optimization Features

Token Efficiency

  • ReAct increases token and cost due to multiple trials (higher inference cost)

Training Optimization

  • Fine-tuning on synthetic PPNL training split

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark uses synthetic 2D grids; real-world complexity and continuous motion are not covered.
  • GPT-4 few-shot results were evaluated on sampled subsets (cost-limited), not the full test set.
  • Unreachable-goal cases are underrepresented in some splits, limiting conclusions about reachability reasoning.

When Not To Use

  • Do not rely on ReAct prompting for long-distance, single-shot planning where repeated API calls are infeasible.
  • Avoid using fine-tuned small LLMs when environment sizes or obstacle densities deviate from training data.

Failure Modes

  • Long-horizon planning failures: ReAct often succeeds locally but fails to produce long uninterrupted optimal plans.
  • Poor unreachable-goal detection, leading to wasted action sequences.
  • Fine-tuned models overfit to grid size and obstacle distributions used during training; OOD performance drops sharply.

Core Entities

Models

  • GPT-4
  • GPT-4V
  • BART-base
  • BART-large
  • T5-base
  • T5-large

Metrics

  • Success Rate
  • Optimal Rate
  • Accuracy
  • Feasible Rate
  • Distance to Goal(s)

Datasets

  • PPNL (this paper)

Benchmarks

  • PPNL

Context Entities

Models

  • Previous LLM baselines referenced (e.g., GPT variants)

Metrics

  • Standard path-planning metrics (used as comparison)

Datasets

  • ALFRED
  • TextWorld
  • ReaSCAN

Benchmarks

  • Prior grounded reasoning benchmarks compared in Table 1