Small, irrelevant changes to Theory-of-Mind vignettes make GPT-3.5 fail

Overview

Decision SnapshotReady For Pilot

The paper gives clear targeted examples showing brittleness in GPT-3.5; evidence is empirical but limited to directed vignettes and one main model.

Citations79

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 100%

Production readiness: 100%

Novelty: 100%

Authors

Tomer Ullman

Links

Abstract / PDF / Data

Why It Matters For Business

Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead

Summary TLDR

The author replicates and perturbs classic Theory-of-Mind (ToM) text vignettes that prior work claimed GPT-3.5 passed. Small, commonsense-preserving changes—making containers transparent, saying the agent cannot read, adding trusted testimony, changing 'in' to 'on', or querying the mover—flip the model's answers. The paper argues these failures show GPT-3.5 lacks robust ToM and calls for skeptical evaluation and more principled tests.

Problem Statement

Recent claims that large language models (LLMs) exhibit Theory-of-Mind rely on typical ToM vignettes. The paper asks: are those successes robust to small changes that should not affect a true ToM reasoner? If not, passing such tests may be superficial.

Main Contribution

Systematic perturbations of classic ToM vignettes used in prior work to test GPT-3.5.

Empirical demonstration that simple, logically-irrelevant changes flip model answers from correct to incorrect.

Key Findings

Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.

NumbersVariation 1A: P(chocolate)=95% vs P(popcorn)=0%

Practical UseDo not treat a model's pass on an opaque-container ToM test as proof of robust belief reasoning; test perceptual-access variants.

Evidence RefSection 2.1.1

Stating the agent cannot read still led GPT-3.5 to attribute belief from the label.

NumbersVariation 1B: P(chocolate)=98% when 'Sam cannot read'

Practical UseCheck whether models actually use agent knowledge constraints (like literacy) before trusting ToM judgments.

Evidence RefSection 2.1.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Content prompt (original)	P(popcorn)=100%; P(chocolate)=0%	—	—	Unexpected-contents vignette (base)	Kosinski reported content prompt: P(popcorn)=100%, P(chocolate)=0%	Section 2.1 (summary of Kosinski results)
Belief prompt (original)	P(chocolate)=99% (Kosinski)	—	—	Unexpected-contents belief prompt	Kosinski reported belief prompt: P(chocolate)=99%	Section 2.1 (summary of Kosinski results)

What To Try In 7 Days

Replicate key ToM prompts used in your product and add simple perturbations (visibility, testimony, relation words)

Treat single-pass vignette success as weak evidence; run targeted stress tests on agent knowledge and perceptual access

Add unit tests that check consistency across agents and scenarios (ask about all agents' beliefs)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Materials and methods from Kosinski (2023) reported as publicly available in paper

Risks & Boundaries

Limitations

Experiments focus mainly on GPT-3.5; other models not exhaustively tested

Not a full automated benchmark or large-scale sweep of prompts

When Not To Use

Do not use passing of basic ToM vignettes as proof of human-like belief reasoning

Do not assume success on a narrow prompt set generalizes to real-world agent modeling

Failure Modes

Sensitivity to small, semantically-irrelevant prompt changes

Overreliance on surface cues like labels instead of agent knowledge

Core Entities

Models

GPT-3.5

Metrics

Model completion probabilities for answer tokens (P(answer))

Datasets

Classic Theory-of-Mind vignettes (unexpected-contents; Sally-Anne style)

Benchmarks

Text-based Theory-of-Mind probe suite (based on Kosinski 2023 vignettes)

Context Entities

Models

Other LLMs referenced in prior work (unspecified)

Metrics

Comparison to child-level performance claims (qualitative)

Datasets

Human ToM tests and developmental paradigms (smarties/unexpected contents, Sally-Anne)

Benchmarks

Kosinski (2023) ToM evaluation materials

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.

Stating the agent cannot read still led GPT-3.5 to attribute belief from the label.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding