Use short dialogues (not static tests) to map where LLMs fail at everyday spatial reasoning

April 22, 20236 min

Overview

Decision SnapshotNeeds Validation

The paper gives clear, replicable manual examples showing failure modes, but results are qualitative and sample-limited so treat findings as indicative rather than definitive.

Citations9

Evidence Strength0.60

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 0/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 100%

Novelty: 60%

Authors

Anthony G Cohn, Jose Hernandez-Orallo

Links

Abstract / PDF / Data

Why It Matters For Business

If your product relies on spatial commonsense (navigation, robotics, instructions), off-the-shelf LLMs can appear confident but make non-obvious errors; you must validate behavior with multi-turn tests before deployment.

Who Should Care

Summary TLDR

The authors argue that static benchmarks hide many spatial commonsense failures. They introduce a dialectical (dialogue-based) evaluation and run manual conversations with ChatGPT (GPT-3.5 & GPT-4 variants) and Bard on a curated set of spatial problems (parthood, rotation, direction, size, affordances, object permanence). Results: fluent answers are common but so are systematic errors, contradictions, and wrong justifications. Dialectical probing reveals boundaries missed by aggregate scores. Data and full dialogs are in the appendix.

Problem Statement

Static benchmarks and aggregated scores can overstate LLM commonsense. Training-data leakage, dataset artifacts, multiple-choice shortcuts, and aggregation hide failures. The paper asks: can iterative dialogue (feeding answers back into the session) find concrete spatial-reasoning breakdowns and map model limits?

Main Contribution

Propose and demonstrate a dialectical (dialogue-driven) evaluation that probes consistency and boundary cases rather than aggregate accuracy

Curate and run many manual spatial reasoning probes (parthood, rotation, direction, size, affordances, chains, object permanence) against ChatGPT (GPT-3.5, GPT-4) and Bard

Key Findings

LLMs often answer fluently but make basic spatial mistakes

Practical UseDo not trust fluent text as evidence of correct spatial reasoning; test with follow-up questions and contradictions

Evidence RefSections 2.1-2.4 (dialogue examples) and Table 1

Responding within the same chat reveals inconsistencies that isolated prompts miss

Practical UseUse multi-turn dialogues (keep context) to expose unstable beliefs and inconsistent justifications

Evidence RefIntroduction and Section 2 (multiple prompts within conversations show reversals

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dialectical probe outcomesMany fluent answers but frequent incorrect or contradictory responses across spatial tasksmanual spatial prompts (Sections 2.12.4)Multiple example dialogs (trophy/case, bookcase, rotation, chains) and Table 1Sections 2 and Appendix
Model comparison (qualitative)GPT-4 often more correct than GPT-3.5 but still fails on many variantsGPT-3.5 behaviourimprovement observed but not completesame prompts across modelsTable 1 summary and interleaved dialog examplesSection 2 and Appendix

What To Try In 7 Days

Run short dialectical scripts: ask a spatial question, then challenge the model with variants and ask for justifications

Create a small tree of follow-up prompts for critical spatial cases in your app and record inconsistencies

Treat fluent explanations as hypotheses; require cross-checks or simulation for safety-critical spatial claims

Agent Features

Memory
uses chat context (short-term conversation history)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

Appendix of the paper (full dialogs) as referenced in the paper

Risks & Boundaries

Limitations

Manual, small-scale and not exhaustive across models or prompt space

Stochastic LLM behavior means single dialogues may not generalize

When Not To Use

Don't rely on out-of-the-box LLMs for safety-critical spatial decisions

Don't replace structured simulation or symbolic reasoning with LLM output for precise geometry

Failure Modes

Self-contradiction across follow-ups

Using linguistic position heuristics instead of semantic reasoning (pronoun errors)

Core Entities

Models

GPT-3.5-turbo (ChatGPT variants)GPT-4 (ChatGPT variants)Bard (LaMDA-based)

Metrics

Manual correctness labels (√ / ½ / ×) summarized in Table 1 (no standard numeric metric)

Datasets

Commonsense Problem Page (CPP)Winograd Schema Challenge (WSC) examples (used as inspiration)

Benchmarks

Winograd SchemaBIG-Bench (discussed)