Overview
The method is practical for scalable evaluation and produces high‑quality items, but model biases and some circularity risks need mitigation for high‑stakes use.
Citations20
Evidence Strength0.80
Confidence0.87
Risk Signals8
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
License: MIT
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
If you deploy LLMs to reason about human intentions, use controlled ToM checks: GPT‑4 often matches human patterns but is unreliable on harder inferences, and other models usually perform worse.
Who Should Care
Summary TLDR
The authors introduce a three-step, causal‑template method that uses an LLM (GPT-4) to populate abstract story templates and produce BigToM, a 5,000‑item benchmark for Theory‑of‑Mind (ToM). Humans rate BigToM higher than crowd-sourced tests and comparable to expert-written items. Evaluations show GPT-4 mirrors human inference patterns on many ToM probes (high accuracy on forward belief/action) but is less reliable on hard backward-belief tasks; other tested models perform noticeably worse. The dataset and prompts are released for reproducible testing.
Problem Statement
Existing ToM evaluations are small, noisy, or ambiguous and lack systematic controls, so prior claims about LLMs' social reasoning are hard to interpret. We need a scalable, controlled, and diverse test set that isolates inferential steps.
Main Contribution
A causal‑template pipeline to generate controlled, model‑written ToM test items.
BigToM: a 5,000‑item benchmark with 25 conditions and built-in control contrasts.
Key Findings
Model-written benchmark (BigToM) is large and well-rated by humans.
GPT‑4 performs near-perfect on forward belief/action but worse on backward belief.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| BigToM size | 5,000 test items | — | — | BigToM (model-written) | Generated from 200 templates × 25 conditions | Sec.3.3 |
| Human quality rating advantage vs socialIQa | aggregate contrast +1.152 (mean rating units) | socialIQa | — | Human rating (N participants) | Bayesian mixed model contrast BigToM - socialIQa = 1.152 (95% CI 1.066–1.244) | App. B.1, Table 3 |
What To Try In 7 Days
Run BigToM on your candidate models to map strengths and failure modes.
Use the causal‑template prompts to create targeted tests for your product's social scenarios.
If you rely on hidden‑state inference, add human review or specialized prompting for backward‑belief cases.
Reproducibility
Risks & Boundaries
Limitations
Model‑generated content can reflect training biases and overproduce stereotyped contexts.
Circularity risk: the generator is an LLM, though cross‑generation checks reduce this concern.
When Not To Use
Do not rely on BigToM alone for high‑stakes decisions without human review.
Avoid using fully automated generation in domains where the model cannot reliably imagine plausible events.
Failure Modes
Models may anchor on an explicitly stated initial belief and ignore perceptual updates.
Backward belief inference (inferring hidden beliefs from actions) is the weakest and least reliable capability.

