Small, irrelevant changes to Theory-of-Mind vignettes make GPT-3.5 fail

February 16, 20236 min

Overview

Decision SnapshotReady For Pilot

The paper gives clear targeted examples showing brittleness in GPT-3.5; evidence is empirical but limited to directed vignettes and one main model.

Citations79

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 100%

Production readiness: 100%

Novelty: 100%

Authors

Tomer Ullman

Links

Abstract / PDF / Data

Why It Matters For Business

Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.

Who Should Care

Summary TLDR

The author replicates and perturbs classic Theory-of-Mind (ToM) text vignettes that prior work claimed GPT-3.5 passed. Small, commonsense-preserving changes—making containers transparent, saying the agent cannot read, adding trusted testimony, changing 'in' to 'on', or querying the mover—flip the model's answers. The paper argues these failures show GPT-3.5 lacks robust ToM and calls for skeptical evaluation and more principled tests.

Problem Statement

Recent claims that large language models (LLMs) exhibit Theory-of-Mind rely on typical ToM vignettes. The paper asks: are those successes robust to small changes that should not affect a true ToM reasoner? If not, passing such tests may be superficial.

Main Contribution

Systematic perturbations of classic ToM vignettes used in prior work to test GPT-3.5.

Empirical demonstration that simple, logically-irrelevant changes flip model answers from correct to incorrect.

Key Findings

Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.

NumbersVariation 1A: P(chocolate)=95% vs P(popcorn)=0%

Practical UseDo not treat a model's pass on an opaque-container ToM test as proof of robust belief reasoning; test perceptual-access variants.

Evidence RefSection 2.1.1

Stating the agent cannot read still led GPT-3.5 to attribute belief from the label.

NumbersVariation 1B: P(chocolate)=98% when 'Sam cannot read'

Practical UseCheck whether models actually use agent knowledge constraints (like literacy) before trusting ToM judgments.

Evidence RefSection 2.1.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Content prompt (original)P(popcorn)=100%; P(chocolate)=0%Unexpected-contents vignette (base)Kosinski reported content prompt: P(popcorn)=100%, P(chocolate)=0%Section 2.1 (summary of Kosinski results)
Belief prompt (original)P(chocolate)=99% (Kosinski)Unexpected-contents belief promptKosinski reported belief prompt: P(chocolate)=99%Section 2.1 (summary of Kosinski results)

What To Try In 7 Days

Replicate key ToM prompts used in your product and add simple perturbations (visibility, testimony, relation words)

Treat single-pass vignette success as weak evidence; run targeted stress tests on agent knowledge and perceptual access

Add unit tests that check consistency across agents and scenarios (ask about all agents' beliefs)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Materials and methods from Kosinski (2023) reported as publicly available in paper

Risks & Boundaries

Limitations

Experiments focus mainly on GPT-3.5; other models not exhaustively tested

Not a full automated benchmark or large-scale sweep of prompts

When Not To Use

Do not use passing of basic ToM vignettes as proof of human-like belief reasoning

Do not assume success on a narrow prompt set generalizes to real-world agent modeling

Failure Modes

Sensitivity to small, semantically-irrelevant prompt changes

Overreliance on surface cues like labels instead of agent knowledge

Core Entities

Models

GPT-3.5

Metrics

Model completion probabilities for answer tokens (P(answer))

Datasets

Classic Theory-of-Mind vignettes (unexpected-contents; Sally-Anne style)

Benchmarks

Text-based Theory-of-Mind probe suite (based on Kosinski 2023 vignettes)

Context Entities

Models

Other LLMs referenced in prior work (unspecified)

Metrics

Comparison to child-level performance claims (qualitative)

Datasets

Human ToM tests and developmental paradigms (smarties/unexpected contents, Sally-Anne)

Benchmarks

Kosinski (2023) ToM evaluation materials