Use short dialogues (not static tests) to map where LLMs fail at everyday spatial reasoning

April 22, 20236 min

Overview

Production Readiness

1

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

9

Authors

Anthony G Cohn, Jose Hernandez-Orallo

Links

Abstract / PDF

Why It Matters For Business

If your product relies on spatial commonsense (navigation, robotics, instructions), off-the-shelf LLMs can appear confident but make non-obvious errors; you must validate behavior with multi-turn tests before deployment.

Summary TLDR

The authors argue that static benchmarks hide many spatial commonsense failures. They introduce a dialectical (dialogue-based) evaluation and run manual conversations with ChatGPT (GPT-3.5 & GPT-4 variants) and Bard on a curated set of spatial problems (parthood, rotation, direction, size, affordances, object permanence). Results: fluent answers are common but so are systematic errors, contradictions, and wrong justifications. Dialectical probing reveals boundaries missed by aggregate scores. Data and full dialogs are in the appendix.

Problem Statement

Static benchmarks and aggregated scores can overstate LLM commonsense. Training-data leakage, dataset artifacts, multiple-choice shortcuts, and aggregation hide failures. The paper asks: can iterative dialogue (feeding answers back into the session) find concrete spatial-reasoning breakdowns and map model limits?

Main Contribution

Propose and demonstrate a dialectical (dialogue-driven) evaluation that probes consistency and boundary cases rather than aggregate accuracy

Curate and run many manual spatial reasoning probes (parthood, rotation, direction, size, affordances, chains, object permanence) against ChatGPT (GPT-3.5, GPT-4) and Bard

Show concrete failure modes (contradictions, wrong reasoning, overconfident linguistic heuristics) and offer directions for automated/adaptive test trees

Key Findings

LLMs often answer fluently but make basic spatial mistakes

Responding within the same chat reveals inconsistencies that isolated prompts miss

GPT-4 shows improvements but still commits nontrivial spatial errors

Errors include wrong physical inferences, self-contradiction, and linguistic heuristics used as reasoning

Results

Dialectical probe outcomes

ValueMany fluent answers but frequent incorrect or contradictory responses across spatial tasks

Model comparison (qualitative)

ValueGPT-4 often more correct than GPT-3.5 but still fails on many variants

BaselineGPT-3.5 behaviour

Robustness to minor variants

ValueHigh sensitivity: small wording changes produce different answers

Who Should Care

What To Try In 7 Days

Run short dialectical scripts: ask a spatial question, then challenge the model with variants and ask for justifications

Create a small tree of follow-up prompts for critical spatial cases in your app and record inconsistencies

Treat fluent explanations as hypotheses; require cross-checks or simulation for safety-critical spatial claims

Agent Features

Memory

  • uses chat context (short-term conversation history)

Reproducibility

Data Urls

  • Appendix of the paper (full dialogs) as referenced in the paper

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Manual, small-scale and not exhaustive across models or prompt space
  • Stochastic LLM behavior means single dialogues may not generalize
  • No automated adaptive testing infrastructure implemented
  • Comparisons are qualitative; no standard numeric benchmark reported

When Not To Use

  • Don't rely on out-of-the-box LLMs for safety-critical spatial decisions
  • Don't replace structured simulation or symbolic reasoning with LLM output for precise geometry
  • Avoid trusting single-turn answers for tasks needing consistent multi-step spatial reasoning

Failure Modes

  • Self-contradiction across follow-ups
  • Using linguistic position heuristics instead of semantic reasoning (pronoun errors)
  • Incorrect physical intuitions (e.g., larger circle inside smaller)
  • Overconfident but fallacious justifications
  • High sensitivity to wording and chat history

Core Entities

Models

  • GPT-3.5-turbo (ChatGPT variants)
  • GPT-4 (ChatGPT variants)
  • Bard (LaMDA-based)

Metrics

  • Manual correctness labels (√ / ½ / ×) summarized in Table 1 (no standard numeric metric)

Datasets

  • Commonsense Problem Page (CPP)
  • Winograd Schema Challenge (WSC) examples (used as inspiration)

Benchmarks

  • Winograd Schema
  • BIG-Bench (discussed)