Overview
The paper gives clear targeted examples showing brittleness in GPT-3.5; evidence is empirical but limited to directed vignettes and one main model.
Citations79
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 100%
Production readiness: 100%
Novelty: 100%
Why It Matters For Business
Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.
Who Should Care
Summary TLDR
The author replicates and perturbs classic Theory-of-Mind (ToM) text vignettes that prior work claimed GPT-3.5 passed. Small, commonsense-preserving changes—making containers transparent, saying the agent cannot read, adding trusted testimony, changing 'in' to 'on', or querying the mover—flip the model's answers. The paper argues these failures show GPT-3.5 lacks robust ToM and calls for skeptical evaluation and more principled tests.
Problem Statement
Recent claims that large language models (LLMs) exhibit Theory-of-Mind rely on typical ToM vignettes. The paper asks: are those successes robust to small changes that should not affect a true ToM reasoner? If not, passing such tests may be superficial.
Main Contribution
Systematic perturbations of classic ToM vignettes used in prior work to test GPT-3.5.
Empirical demonstration that simple, logically-irrelevant changes flip model answers from correct to incorrect.
Key Findings
Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.
Stating the agent cannot read still led GPT-3.5 to attribute belief from the label.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Content prompt (original) | P(popcorn)=100%; P(chocolate)=0% | — | — | Unexpected-contents vignette (base) | Kosinski reported content prompt: P(popcorn)=100%, P(chocolate)=0% | Section 2.1 (summary of Kosinski results) |
| Belief prompt (original) | P(chocolate)=99% (Kosinski) | — | — | Unexpected-contents belief prompt | Kosinski reported belief prompt: P(chocolate)=99% | Section 2.1 (summary of Kosinski results) |
What To Try In 7 Days
Replicate key ToM prompts used in your product and add simple perturbations (visibility, testimony, relation words)
Treat single-pass vignette success as weak evidence; run targeted stress tests on agent knowledge and perceptual access
Add unit tests that check consistency across agents and scenarios (ask about all agents' beliefs)
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments focus mainly on GPT-3.5; other models not exhaustively tested
Not a full automated benchmark or large-scale sweep of prompts
When Not To Use
Do not use passing of basic ToM vignettes as proof of human-like belief reasoning
Do not assume success on a narrow prompt set generalizes to real-world agent modeling
Failure Modes
Sensitivity to small, semantically-irrelevant prompt changes
Overreliance on surface cues like labels instead of agent knowledge

