BigToM: a 5,000-item, model‑written benchmark that tests Theory-of‑Mind with causal templates

Overview

Decision SnapshotReady For Pilot

The method is practical for scalable evaluation and produces high‑quality items, but model biases and some circularity risks need mitigation for high‑stakes use.

Citations20

Evidence Strength0.80

Confidence0.87

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

License: MIT

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 65%

Authors

Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy LLMs to reason about human intentions, use controlled ToM checks: GPT‑4 often matches human patterns but is unreliable on harder inferences, and other models usually perform worse.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The authors introduce a three-step, causal‑template method that uses an LLM (GPT-4) to populate abstract story templates and produce BigToM, a 5,000‑item benchmark for Theory‑of‑Mind (ToM). Humans rate BigToM higher than crowd-sourced tests and comparable to expert-written items. Evaluations show GPT-4 mirrors human inference patterns on many ToM probes (high accuracy on forward belief/action) but is less reliable on hard backward-belief tasks; other tested models perform noticeably worse. The dataset and prompts are released for reproducible testing.

Problem Statement

Existing ToM evaluations are small, noisy, or ambiguous and lack systematic controls, so prior claims about LLMs' social reasoning are hard to interpret. We need a scalable, controlled, and diverse test set that isolates inferential steps.

Main Contribution

A causal‑template pipeline to generate controlled, model‑written ToM test items.

BigToM: a 5,000‑item benchmark with 25 conditions and built-in control contrasts.

Key Findings

Model-written benchmark (BigToM) is large and well-rated by humans.

Numbers5,000 items; expert structure-agreement 93.94%; expert mean quality ≈4.34/5

Practical UseYou can cheaply produce thousands of controlled ToM tests that human raters judge as high quality; use this to scale systematic evaluation instead of small hand-crafted sets.

Evidence RefSec.3.3-3.4; Expert evaluations; App. B.1

GPT‑4 performs near-perfect on forward belief/action but worse on backward belief.

NumbersForward belief/action ≈99% accuracy (best prompts); backward belief ≈86% (TB) and 62% (FB) 0‑shot

Practical UseExpect GPT‑4 to reliably infer beliefs from direct percepts and to predict actions, but do not trust it for inferring hidden beliefs from observed actions without extra support or prompting.

Evidence RefTab.2, Fig.3, Sec.4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BigToM size	5,000 test items	—	—	BigToM (model-written)	Generated from 200 templates × 25 conditions	Sec.3.3
Human quality rating advantage vs socialIQa	aggregate contrast +1.152 (mean rating units)	socialIQa	—	Human rating (N participants)	Bayesian mixed model contrast BigToM - socialIQa = 1.152 (95% CI 1.066–1.244)	App. B.1, Table 3

What To Try In 7 Days

Run BigToM on your candidate models to map strengths and failure modes.

Use the causal‑template prompts to create targeted tests for your product's social scenarios.

If you rely on hidden‑state inference, add human review or specialized prompting for backward‑belief cases.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseMIT

Code URLs

https://sites.google.com/view/social-reasoning-lms

Data URLs

https://sites.google.com/view/social-reasoning-lms

Risks & Boundaries

Limitations

Model‑generated content can reflect training biases and overproduce stereotyped contexts.

Circularity risk: the generator is an LLM, though cross‑generation checks reduce this concern.

When Not To Use

Do not rely on BigToM alone for high‑stakes decisions without human review.

Avoid using fully automated generation in domains where the model cannot reliably imagine plausible events.

Failure Modes

Models may anchor on an explicitly stated initial belief and ignore perceptual updates.

Backward belief inference (inferring hidden beliefs from actions) is the weakest and least reliable capability.

Core Entities

Models

gpt-4-0314gpt-3.5-turbotext-davinci-003claude-v1.3claude-2llama-65b-q5

Metrics

Accuracy

Datasets

BigToMSocialIQAToMiAdv-CSFB

Benchmarks

BigToMsocialIQaToMiAdv-CSFB

Context Entities

Models

gpt-4-0314

Metrics

human quality ratingsAccuracy

Datasets

BigToM

Benchmarks

socialIQa

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Model-written benchmark (BigToM) is large and well-rated by humans.

GPT‑4 performs near-perfect on forward belief/action but worse on backward belief.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding