Small, irrelevant changes to Theory-of-Mind vignettes make GPT-3.5 fail

February 16, 20236 min

Overview

Production Readiness

1

Novelty Score

1

Cost Impact Score

1

Citation Count

79

Authors

Tomer Ullman

Links

Abstract / PDF

Why It Matters For Business

Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.

Summary TLDR

The author replicates and perturbs classic Theory-of-Mind (ToM) text vignettes that prior work claimed GPT-3.5 passed. Small, commonsense-preserving changes—making containers transparent, saying the agent cannot read, adding trusted testimony, changing 'in' to 'on', or querying the mover—flip the model's answers. The paper argues these failures show GPT-3.5 lacks robust ToM and calls for skeptical evaluation and more principled tests.

Problem Statement

Recent claims that large language models (LLMs) exhibit Theory-of-Mind rely on typical ToM vignettes. The paper asks: are those successes robust to small changes that should not affect a true ToM reasoner? If not, passing such tests may be superficial.

Main Contribution

Systematic perturbations of classic ToM vignettes used in prior work to test GPT-3.5.

Empirical demonstration that simple, logically-irrelevant changes flip model answers from correct to incorrect.

Argument for a skeptical evaluation baseline: isolated success rates can hide brittle, non-generalizable behavior.

Key Findings

Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.

NumbersVariation 1A: P(chocolate)=95% vs P(popcorn)=0%

Stating the agent cannot read still led GPT-3.5 to attribute belief from the label.

NumbersVariation 1B: P(chocolate)=98% when 'Sam cannot read'

Adding trusted testimony that contradicts the label did not prevent the model from favoring the label.

NumbersVariation 1C: P(chocolate)=97% though 'Sam believes her friend' said popcorn

Changing containers or relations (opaque→transparent; 'in'→'on') breaks correct transfer reasoning.

NumbersVariation 2A: P(chest)=94% though chest transparent; Variation 2B: errors when using 'on'

Querying the person who actually moved the object yields inconsistent answers.

NumbersVariation 2D: For Mark (mover), P(basket)=99% for 'thinks' but P(box)=43% for 'will look', split answers

Results

Content prompt (original)

ValueP(popcorn)=100%; P(chocolate)=0%

Belief prompt (original)

ValueP(chocolate)=99% (Kosinski)

Variation 1A (transparent)

ValueP(chocolate)=95%; P(popcorn)=0%

BaselineContent prompt P(popcorn)=100%

Variation 2A (transparent containers)

ValueP(chest)=94%; P(box)=0%

BaselineOriginal transfer prompt P(basket)=98%

Who Should Care

What To Try In 7 Days

Replicate key ToM prompts used in your product and add simple perturbations (visibility, testimony, relation words)

Treat single-pass vignette success as weak evidence; run targeted stress tests on agent knowledge and perceptual access

Add unit tests that check consistency across agents and scenarios (ask about all agents' beliefs)

Reproducibility

Data Urls

  • Materials and methods from Kosinski (2023) reported as publicly available in paper

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments focus mainly on GPT-3.5; other models not exhaustively tested
  • Not a full automated benchmark or large-scale sweep of prompts
  • Some reported probabilities depend on exact prompt formatting and model version

When Not To Use

  • Do not use passing of basic ToM vignettes as proof of human-like belief reasoning
  • Do not assume success on a narrow prompt set generalizes to real-world agent modeling

Failure Modes

  • Sensitivity to small, semantically-irrelevant prompt changes
  • Overreliance on surface cues like labels instead of agent knowledge
  • Inconsistent attribution across different agents in the same story
  • Confusion with simple relational language ('in' vs 'on')

Core Entities

Models

  • GPT-3.5

Metrics

  • Model completion probabilities for answer tokens (P(answer))

Datasets

  • Classic Theory-of-Mind vignettes (unexpected-contents; Sally-Anne style)

Benchmarks

  • Text-based Theory-of-Mind probe suite (based on Kosinski 2023 vignettes)

Context Entities

Models

  • Other LLMs referenced in prior work (unspecified)

Metrics

  • Comparison to child-level performance claims (qualitative)

Datasets

  • Human ToM tests and developmental paradigms (smarties/unexpected contents, Sally-Anne)

Benchmarks

  • Kosinski (2023) ToM evaluation materials