Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.5
Citation Count
20
Why It Matters For Business
If you deploy LLMs to reason about human intentions, use controlled ToM checks: GPT‑4 often matches human patterns but is unreliable on harder inferences, and other models usually perform worse.
Summary TLDR
The authors introduce a three-step, causal‑template method that uses an LLM (GPT-4) to populate abstract story templates and produce BigToM, a 5,000‑item benchmark for Theory‑of‑Mind (ToM). Humans rate BigToM higher than crowd-sourced tests and comparable to expert-written items. Evaluations show GPT-4 mirrors human inference patterns on many ToM probes (high accuracy on forward belief/action) but is less reliable on hard backward-belief tasks; other tested models perform noticeably worse. The dataset and prompts are released for reproducible testing.
Problem Statement
Existing ToM evaluations are small, noisy, or ambiguous and lack systematic controls, so prior claims about LLMs' social reasoning are hard to interpret. We need a scalable, controlled, and diverse test set that isolates inferential steps.
Main Contribution
A causal‑template pipeline to generate controlled, model‑written ToM test items.
BigToM: a 5,000‑item benchmark with 25 conditions and built-in control contrasts.
A systematic evaluation showing GPT‑4 approximates human ToM patterns while other LLMs struggle; release of prompts and data for replication.
Key Findings
Model-written benchmark (BigToM) is large and well-rated by humans.
GPT‑4 performs near-perfect on forward belief/action but worse on backward belief.
Other tested LLMs often fall short, especially on false‑belief and backward tasks.
Results
BigToM size
Human quality rating advantage vs socialIQa
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run BigToM on your candidate models to map strengths and failure modes.
Use the causal‑template prompts to create targeted tests for your product's social scenarios.
If you rely on hidden‑state inference, add human review or specialized prompting for backward‑belief cases.
Reproducibility
License
- MIT
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Model‑generated content can reflect training biases and overproduce stereotyped contexts.
- Circularity risk: the generator is an LLM, though cross‑generation checks reduce this concern.
- Stories have syntactic similarity and a small fraction (~1–3%) contain commonsense errors.
When Not To Use
- Do not rely on BigToM alone for high‑stakes decisions without human review.
- Avoid using fully automated generation in domains where the model cannot reliably imagine plausible events.
Failure Modes
- Models may anchor on an explicitly stated initial belief and ignore perceptual updates.
- Backward belief inference (inferring hidden beliefs from actions) is the weakest and least reliable capability.
- Generated dataset may underrepresent some real‑world social situations due to steering limits.
Core Entities
Models
- gpt-4-0314
- gpt-3.5-turbo
- text-davinci-003
- claude-v1.3
- claude-2
- llama-65b-q5
Metrics
- Accuracy
Datasets
- BigToM
- SocialIQA
- ToMi
- Adv-CSFB
Benchmarks
- BigToM
- socialIQa
- ToMi
- Adv-CSFB
Context Entities
Models
- gpt-4-0314
Metrics
- human quality ratings
- Accuracy
Datasets
- BigToM
Benchmarks
- socialIQa

