Overview
Production Readiness
1
Novelty Score
1
Cost Impact Score
1
Citation Count
79
Why It Matters For Business
Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.
Summary TLDR
The author replicates and perturbs classic Theory-of-Mind (ToM) text vignettes that prior work claimed GPT-3.5 passed. Small, commonsense-preserving changes—making containers transparent, saying the agent cannot read, adding trusted testimony, changing 'in' to 'on', or querying the mover—flip the model's answers. The paper argues these failures show GPT-3.5 lacks robust ToM and calls for skeptical evaluation and more principled tests.
Problem Statement
Recent claims that large language models (LLMs) exhibit Theory-of-Mind rely on typical ToM vignettes. The paper asks: are those successes robust to small changes that should not affect a true ToM reasoner? If not, passing such tests may be superficial.
Main Contribution
Systematic perturbations of classic ToM vignettes used in prior work to test GPT-3.5.
Empirical demonstration that simple, logically-irrelevant changes flip model answers from correct to incorrect.
Argument for a skeptical evaluation baseline: isolated success rates can hide brittle, non-generalizable behavior.
Key Findings
Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.
Stating the agent cannot read still led GPT-3.5 to attribute belief from the label.
Adding trusted testimony that contradicts the label did not prevent the model from favoring the label.
Changing containers or relations (opaque→transparent; 'in'→'on') breaks correct transfer reasoning.
Querying the person who actually moved the object yields inconsistent answers.
Results
Content prompt (original)
Belief prompt (original)
Variation 1A (transparent)
Variation 2A (transparent containers)
Who Should Care
What To Try In 7 Days
Replicate key ToM prompts used in your product and add simple perturbations (visibility, testimony, relation words)
Treat single-pass vignette success as weak evidence; run targeted stress tests on agent knowledge and perceptual access
Add unit tests that check consistency across agents and scenarios (ask about all agents' beliefs)
Reproducibility
Data Urls
- Materials and methods from Kosinski (2023) reported as publicly available in paper
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments focus mainly on GPT-3.5; other models not exhaustively tested
- Not a full automated benchmark or large-scale sweep of prompts
- Some reported probabilities depend on exact prompt formatting and model version
When Not To Use
- Do not use passing of basic ToM vignettes as proof of human-like belief reasoning
- Do not assume success on a narrow prompt set generalizes to real-world agent modeling
Failure Modes
- Sensitivity to small, semantically-irrelevant prompt changes
- Overreliance on surface cues like labels instead of agent knowledge
- Inconsistent attribution across different agents in the same story
- Confusion with simple relational language ('in' vs 'on')
Core Entities
Models
- GPT-3.5
Metrics
- Model completion probabilities for answer tokens (P(answer))
Datasets
- Classic Theory-of-Mind vignettes (unexpected-contents; Sally-Anne style)
Benchmarks
- Text-based Theory-of-Mind probe suite (based on Kosinski 2023 vignettes)
Context Entities
Models
- Other LLMs referenced in prior work (unspecified)
Metrics
- Comparison to child-level performance claims (qualitative)
Datasets
- Human ToM tests and developmental paradigms (smarties/unexpected contents, Sally-Anne)
Benchmarks
- Kosinski (2023) ToM evaluation materials

