Overview
The protocol and repeated tests provide strong evidence that vanilla LLMs lack reliable multi‑step planning; results generalize across models and graph types but do not evaluate all augmentation strategies.
Citations22
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 20%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
Do not assume LLMs can plan multi‑step tasks from text alone; failures scale with graph complexity and can cause incorrect or looping actions in planning applications.
Who Should Care
Summary TLDR
The authors introduce CogEval, a cognitive‑science–inspired protocol for testing functional abilities in LLMs, then use it to test cognitive maps and planning across many models (GPT‑4, GPT‑3.5, davinci, Bard, Claude, Cohere, LLaMA, Alpaca, etc.). Results: LLMs can solve simple route tasks by memorizing text, but systematically fail on tasks that require extracting and using the latent relational structure (cognitive maps). Common failures include hallucinated edges, suboptimal long routes, and loops. Chain‑of‑Thought (BFS/DFS) prompts help in some narrow cases but do not restore robust planning. The paper argues caution for out‑of‑the‑box planning uses and offers CogEval as a reproducible,
Problem Statement
Do modern LLMs understand latent relational structure (cognitive maps) and use it for multi‑step, goal‑directed planning? Prior claims are often anecdotal or contaminated; this work creates a controlled protocol and novel prompts (inspired by human experiments) to test planning across graph types, domains, and prompt conditions.
Main Contribution
Define CogEval: a cognitive‑science–inspired protocol for systematic testing of cognitive abilities in LLMs (multiple tasks, controls, repeats, stats).
Apply CogEval to cognitive maps and planning across a range of LLMs and task types (spatial, social, object relations; various graph structures).
Key Findings
LLM, graph, domain, and condition strongly predict performance.
GPT‑4 can score near perfect on simple 1‑step path tasks but fails on multi‑step community graphs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| 1stepPath (GPT‑4) | 0.99 mean (SE 0.08) | — | — | Table 3 tasks | GPT‑4 nearly perfect on adjacent goal tasks where route memorization suffices | Table 3 |
| policyReval (GPT‑4) | 0.21 mean (SE 0.18) | — | — | Table 3 tasks (policy revaluation) | Low performance when a change requires finding a new policy | Table 3 |
What To Try In 7 Days
Run CogEval‑style prompts on your model and graph types to measure real planning reliability.
For short tasks, add explicit CoT traversal hints (BFS) and validate outputs automatically.
Where planning matters, add algorithmic checks or a planner module to verify multi‑hop plans.
Agent Features
Memory
Planning
Frameworks
Architectures
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
No access to proprietary models' training data or internals; contamination cannot be excluded but prompts were designed to reduce it.
Human experiments were non‑linguistic; converting them to text may change task dynamics.
When Not To Use
Do not rely on out‑of‑the‑box LLM responses for safety‑critical multi‑hop planning.
Avoid using LLMs alone for planning in dense community graph problems without verification.
Failure Modes
Hallucinated edges: model invents connections that don't exist.
Suboptimal routing: returns longer trajectories instead of shortest path.

