Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.35
Citation Count
6
Why It Matters For Business
LLMs can handle short-range navigation when prompted interactively, but they are not yet reliable for long-distance or out-of-distribution path planning; use fine-tuned models for predictable, repeated environments and ReAct-like prompting for ad-hoc, locally-correct behavior.
Summary TLDR
The authors introduce PPNL, a synthetic grid-based benchmark to test LLMs' spatial-temporal reasoning via path planning. They test GPT-4 with four prompting styles (naive few-shot, action-and-effect, Chain-of-Thought, ReAct) and fine-tune BART/T5. Best in-context result: GPT-4+ReAct reaches 96.1% success on in-distribution 6×6 grids but shows limited long-horizon planning and low unreachable-goal detection. Fine-tuned T5 reaches ~98% in-distribution but fails to generalize to larger or denser grids. The benchmark and code will be released.
Problem Statement
Do text-only LLMs understand and execute long-horizon spatial plans? The paper builds a controlled grid-world benchmark where models must read a natural-language description of obstacles, start and goal(s), then output an action sequence that reaches the goal(s) while avoiding obstacles and obeying ordering constraints.
Main Contribution
PPNL: a synthetic 2D grid benchmark for spatial-temporal reasoning and path planning.
Systematic evaluation of GPT-4 with four prompting strategies: naive few-shot, action-and-effect, Chain-of-Thought (CoT), and ReAct (interleaved reasoning+acting).
Fine-tuned baselines (BART, T5) and analysis of in-distribution vs out-of-distribution generalization.
Empirical findings: ReAct GPT-4 excels locally but lacks long-horizon planning; fine-tuned models perform well IID but fail OOD.
Release plan: dataset generation code, prompts, and implementations for reproducible research.
Key Findings
GPT-4 with ReAct achieved very high in-distribution success but often relies on short trials.
Action-and-effect prompting substantially improves naive few-shot performance.
Fine-tuned small models solve IID tasks nearly perfectly but fail to generalize to larger grids.
Models poorly detect unreachable goals.
Results
Success Rate (GPT-4, ReAct, in-distribution 6×6)
Success Rate (Action-and-Effect vs Naive few-shot)
Success Rate (T5-base fine-tuned, in-distribution)
Success Rate (T5-base fine-tuned, OOD 7×7)
Accuracy
Who Should Care
What To Try In 7 Days
Run ReAct-style iterative prompting when using GPT-4 for short navigation tasks to boost local success.
Fine-tune a small seq2seq model (T5) on your environment if tasks are repeatable and constrained.
Add explicit reachability checks (graph search) before asking an LLM to generate full plans.
Agent Features
Memory
- local step-by-step state given via prompts (no persistent external memory)
Planning
- ReAct: interleaved perceive-reason-act
- Chain-of-Thought: stepwise internal reasoning
- Action-and-Effect: explicit state update in prompt
Tool Use
- No external runtime tools used by models; A* and TSP used offline for ground truth
Frameworks
- ReAct
- Chain-of-Thought
- Action-and-Effect prompting
Is Agentic
true
Architectures
- autoregressive (GPT-4)
- seq2seq (BART, T5)
Optimization Features
Token Efficiency
- ReAct increases token and cost due to multiple trials (higher inference cost)
Training Optimization
- Fine-tuning on synthetic PPNL training split
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark uses synthetic 2D grids; real-world complexity and continuous motion are not covered.
- GPT-4 few-shot results were evaluated on sampled subsets (cost-limited), not the full test set.
- Unreachable-goal cases are underrepresented in some splits, limiting conclusions about reachability reasoning.
When Not To Use
- Do not rely on ReAct prompting for long-distance, single-shot planning where repeated API calls are infeasible.
- Avoid using fine-tuned small LLMs when environment sizes or obstacle densities deviate from training data.
Failure Modes
- Long-horizon planning failures: ReAct often succeeds locally but fails to produce long uninterrupted optimal plans.
- Poor unreachable-goal detection, leading to wasted action sequences.
- Fine-tuned models overfit to grid size and obstacle distributions used during training; OOD performance drops sharply.
Core Entities
Models
- GPT-4
- GPT-4V
- BART-base
- BART-large
- T5-base
- T5-large
Metrics
- Success Rate
- Optimal Rate
- Accuracy
- Feasible Rate
- Distance to Goal(s)
Datasets
- PPNL (this paper)
Benchmarks
- PPNL
Context Entities
Models
- Previous LLM baselines referenced (e.g., GPT variants)
Metrics
- Standard path-planning metrics (used as comparison)
Datasets
- ALFRED
- TextWorld
- ReaSCAN
Benchmarks
- Prior grounded reasoning benchmarks compared in Table 1

