Overview
The paper provides multi-dataset, multi-model evidence that LLMs are brittle on social reasoning; findings are reproducible with the released code and data.
Citations36
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 20%
Production readiness: 25%
Novelty: 45%
Why It Matters For Business
Don't assume LLMs understand people just because they give human-like answers; test models with adversarial and diverse benchmarks before using them for social judgments.
Who Should Care
Summary TLDR
This paper runs a broad evaluation of 15 large language models on six Theory-of-Mind (ToM) benchmarks and a new adversarial dataset (Adv-CSFB). Results show models can solve some narrow ToM-style tasks (e.g., 96% on TriangleCOPA by flan-t5-xxl) but fail or fall to baseline on others (e.g., GPT-4 story-level 27% on FauxPas-EAI). Models are sensitive to probing method and brittle to small adversarial changes, suggesting reliance on surface patterns rather than robust social reasoning. The authors release data and code and recommend multi-dataset, automatic evaluations over anecdotal claims.
Problem Statement
Do modern LLMs truly possess robust Theory-of-Mind (ToM) skills, or do they rely on shallow cues and dataset artifacts? The paper measures LLM ToM across six benchmarks, probes sensitivity to prompting styles, and introduces adversarial examples to test robustness.
Main Contribution
Large-scale meta-evaluation of 15 LLMs on six ToM-related benchmarks, comparing to most-frequent-class baselines.
New adversarial dataset Adv-CSFB that adds true-belief and adversarial variants to common false-belief tests.
Key Findings
Some models excel on narrow ToM-style tasks but not across the board
Performance drops dramatically on some human-style social tests
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 96% | MFC 52% | +44pp | TriangleCOPA (best model flan-t5-xxl) | Figure 1; Table 7 | Table 7 |
| Accuracy | 27% | MFC ~30% | -3pp | FauxPas-EAI (GPT-4) | §6; Table 7 | Table 7 |
What To Try In 7 Days
Run your model on a small Adv-CSFB-style set: transparent, unreadable labels, late labels, trusted testimony.
Compare MC, probability-based, and Chain-of-Thought prompts on your tasks and report differences.
Break datasets into question-type splits (facts vs beliefs) and check worst-case split performance.
Reproducibility
Risks & Boundaries
Limitations
Datasets are limited in scope and size; ToM in real life is broader than these tests.
Some dataset items may be ambiguous; human label variation could affect scores.
When Not To Use
As proof that LLMs have human-like social cognition or consciousness.
To justify deploying models for sensitive social judgments without human oversight.
Failure Modes
Over-reliance on surface patterns and lexicon matches (Clever Hans effect).
High sensitivity to prompt format and probing method.

