CogEval: systematic tests show LLMs fail at cognitive maps and multi‑step planning

Overview

Decision SnapshotNeeds Validation

The protocol and repeated tests provide strong evidence that vanilla LLMs lack reliable multi‑step planning; results generalize across models and graph types but do not evaluate all augmentation strategies.

Citations22

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 60%

Authors

Ida Momennejad, Hosein Hasanbeig, Felipe Vieira, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, Jonathan Larson

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Do not assume LLMs can plan multi‑step tasks from text alone; failures scale with graph complexity and can cause incorrect or looping actions in planning applications.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead

Summary TLDR

The authors introduce CogEval, a cognitive‑science–inspired protocol for testing functional abilities in LLMs, then use it to test cognitive maps and planning across many models (GPT‑4, GPT‑3.5, davinci, Bard, Claude, Cohere, LLaMA, Alpaca, etc.). Results: LLMs can solve simple route tasks by memorizing text, but systematically fail on tasks that require extracting and using the latent relational structure (cognitive maps). Common failures include hallucinated edges, suboptimal long routes, and loops. Chain‑of‑Thought (BFS/DFS) prompts help in some narrow cases but do not restore robust planning. The paper argues caution for out‑of‑the‑box planning uses and offers CogEval as a reproducible,

Problem Statement

Do modern LLMs understand latent relational structure (cognitive maps) and use it for multi‑step, goal‑directed planning? Prior claims are often anecdotal or contaminated; this work creates a controlled protocol and novel prompts (inspired by human experiments) to test planning across graph types, domains, and prompt conditions.

Main Contribution

Define CogEval: a cognitive‑science–inspired protocol for systematic testing of cognitive abilities in LLMs (multiple tasks, controls, repeats, stats).

Apply CogEval to cognitive maps and planning across a range of LLMs and task types (spatial, social, object relations; various graph structures).

Key Findings

LLM, graph, domain, and condition strongly predict performance.

NumbersLLM χ2=2357.87; graph χ2=3431.53; condition χ2=2080.04; domain χ2=458.74 (all p<.001)

Practical UseExpect large performance swings across models and problem graphs; validate on your specific graph types before deployment.

Evidence RefTable 2, Section 3.1

GPT‑4 can score near perfect on simple 1‑step path tasks but fails on multi‑step community graphs.

Numbers1stepPath GPT‑4 mean 0.99 (SE 0.08); policyReval GPT‑4 mean 0.21 (SE 0.18)

Practical UseHigh performance on simple shortcuts does not imply robust multi‑step planning; test revaluation and detour scenarios.

Evidence RefTable 3, Figure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
1stepPath (GPT‑4)	0.99 mean (SE 0.08)	—	—	Table 3 tasks	GPT‑4 nearly perfect on adjacent goal tasks where route memorization suffices	Table 3
policyReval (GPT‑4)	0.21 mean (SE 0.18)	—	—	Table 3 tasks (policy revaluation)	Low performance when a change requires finding a new policy	Table 3

What To Try In 7 Days

Run CogEval‑style prompts on your model and graph types to measure real planning reliability.

For short tasks, add explicit CoT traversal hints (BFS) and validate outputs automatically.

Where planning matters, add algorithmic checks or a planner module to verify multi‑hop plans.

Agent Features

Memory

appears to use prompt memorized trajectories not structured maps

Planning

no robust out-of-box planningCoT (BFS/DFS) can help in narrow cases

Frameworks

CogEval

Architectures

transformer

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://tinyurl.com/cogmaps-in-llm

Data URLs

https://tinyurl.com/cogmaps-in-llm

Risks & Boundaries

Limitations

No access to proprietary models' training data or internals; contamination cannot be excluded but prompts were designed to reduce it.

Human experiments were non‑linguistic; converting them to text may change task dynamics.

When Not To Use

Do not rely on out‑of‑the‑box LLM responses for safety‑critical multi‑hop planning.

Avoid using LLMs alone for planning in dense community graph problems without verification.

Failure Modes

Hallucinated edges: model invents connections that don't exist.

Suboptimal routing: returns longer trajectories instead of shortest path.

Core Entities

Models

gpt-4gpt-3.5-turbo-175Btext-davinci-003bardcohere-xlarge-52.4Banthropic-claude-1-52Bllama-13Balpaca-7Bpythia-20b

Metrics

Adjusted Rand Index (ARI)success ratelogistic regression deviance / chi-squared

Benchmarks

CogEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM, graph, domain, and condition strongly predict performance.

GPT‑4 can score near perfect on simple 1‑step path tasks but fails on multi‑step community graphs.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding