BigToM: a 5,000-item, model‑written benchmark that tests Theory-of‑Mind with causal templates

June 21, 20236 min

Overview

Decision SnapshotReady For Pilot

The method is practical for scalable evaluation and produces high‑quality items, but model biases and some circularity risks need mitigation for high‑stakes use.

Citations20

Evidence Strength0.80

Confidence0.87

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

License: MIT

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 65%

Authors

Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy LLMs to reason about human intentions, use controlled ToM checks: GPT‑4 often matches human patterns but is unreliable on harder inferences, and other models usually perform worse.

Who Should Care

Summary TLDR

The authors introduce a three-step, causal‑template method that uses an LLM (GPT-4) to populate abstract story templates and produce BigToM, a 5,000‑item benchmark for Theory‑of‑Mind (ToM). Humans rate BigToM higher than crowd-sourced tests and comparable to expert-written items. Evaluations show GPT-4 mirrors human inference patterns on many ToM probes (high accuracy on forward belief/action) but is less reliable on hard backward-belief tasks; other tested models perform noticeably worse. The dataset and prompts are released for reproducible testing.

Problem Statement

Existing ToM evaluations are small, noisy, or ambiguous and lack systematic controls, so prior claims about LLMs' social reasoning are hard to interpret. We need a scalable, controlled, and diverse test set that isolates inferential steps.

Main Contribution

A causal‑template pipeline to generate controlled, model‑written ToM test items.

BigToM: a 5,000‑item benchmark with 25 conditions and built-in control contrasts.

Key Findings

Model-written benchmark (BigToM) is large and well-rated by humans.

Numbers5,000 items; expert structure-agreement 93.94%; expert mean quality ≈4.34/5

Practical UseYou can cheaply produce thousands of controlled ToM tests that human raters judge as high quality; use this to scale systematic evaluation instead of small hand-crafted sets.

Evidence RefSec.3.3-3.4; Expert evaluations; App. B.1

GPT‑4 performs near-perfect on forward belief/action but worse on backward belief.

NumbersForward belief/action ≈99% accuracy (best prompts); backward belief ≈86% (TB) and 62% (FB) 0‑shot

Practical UseExpect GPT‑4 to reliably infer beliefs from direct percepts and to predict actions, but do not trust it for inferring hidden beliefs from observed actions without extra support or prompting.

Evidence RefTab.2, Fig.3, Sec.4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BigToM size5,000 test itemsBigToM (model-written)Generated from 200 templates × 25 conditionsSec.3.3
Human quality rating advantage vs socialIQaaggregate contrast +1.152 (mean rating units)socialIQaHuman rating (N participants)Bayesian mixed model contrast BigToM - socialIQa = 1.152 (95% CI 1.066–1.244)App. B.1, Table 3

What To Try In 7 Days

Run BigToM on your candidate models to map strengths and failure modes.

Use the causal‑template prompts to create targeted tests for your product's social scenarios.

If you rely on hidden‑state inference, add human review or specialized prompting for backward‑belief cases.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseMIT

Risks & Boundaries

Limitations

Model‑generated content can reflect training biases and overproduce stereotyped contexts.

Circularity risk: the generator is an LLM, though cross‑generation checks reduce this concern.

When Not To Use

Do not rely on BigToM alone for high‑stakes decisions without human review.

Avoid using fully automated generation in domains where the model cannot reliably imagine plausible events.

Failure Modes

Models may anchor on an explicitly stated initial belief and ignore perceptual updates.

Backward belief inference (inferring hidden beliefs from actions) is the weakest and least reliable capability.

Core Entities

Models

gpt-4-0314gpt-3.5-turbotext-davinci-003claude-v1.3claude-2llama-65b-q5

Metrics

Accuracy

Datasets

BigToMSocialIQAToMiAdv-CSFB

Benchmarks

BigToMsocialIQaToMiAdv-CSFB

Context Entities

Models

gpt-4-0314

Metrics

human quality ratingsAccuracy

Datasets

BigToM

Benchmarks

socialIQa