BigToM: a 5,000-item, model‑written benchmark that tests Theory-of‑Mind with causal templates

June 21, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.5

Citation Count

20

Authors

Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman

Links

Abstract / PDF

Why It Matters For Business

If you deploy LLMs to reason about human intentions, use controlled ToM checks: GPT‑4 often matches human patterns but is unreliable on harder inferences, and other models usually perform worse.

Summary TLDR

The authors introduce a three-step, causal‑template method that uses an LLM (GPT-4) to populate abstract story templates and produce BigToM, a 5,000‑item benchmark for Theory‑of‑Mind (ToM). Humans rate BigToM higher than crowd-sourced tests and comparable to expert-written items. Evaluations show GPT-4 mirrors human inference patterns on many ToM probes (high accuracy on forward belief/action) but is less reliable on hard backward-belief tasks; other tested models perform noticeably worse. The dataset and prompts are released for reproducible testing.

Problem Statement

Existing ToM evaluations are small, noisy, or ambiguous and lack systematic controls, so prior claims about LLMs' social reasoning are hard to interpret. We need a scalable, controlled, and diverse test set that isolates inferential steps.

Main Contribution

A causal‑template pipeline to generate controlled, model‑written ToM test items.

BigToM: a 5,000‑item benchmark with 25 conditions and built-in control contrasts.

A systematic evaluation showing GPT‑4 approximates human ToM patterns while other LLMs struggle; release of prompts and data for replication.

Key Findings

Model-written benchmark (BigToM) is large and well-rated by humans.

Numbers5,000 items; expert structure-agreement 93.94%; expert mean quality ≈4.34/5

GPT‑4 performs near-perfect on forward belief/action but worse on backward belief.

NumbersForward belief/action ≈99% accuracy (best prompts); backward belief ≈86% (TB) and 62% (FB) 0‑shot

Other tested LLMs often fall short, especially on false‑belief and backward tasks.

NumbersNon‑GPT‑4 models show much lower accuracies across key conditions (see Tab.7 ranges ~40–90% depending on model/condition

Results

BigToM size

Value5,000 test items

Human quality rating advantage vs socialIQa

Valueaggregate contrast +1.152 (mean rating units)

BaselinesocialIQa

Accuracy

Value≈99% (true belief; forward)

Baselinehuman baseline not stated for forward belief

Accuracy

Value≈86% (true belief) / 62% (false belief) 0‑shot

Baselinehumans: 82% (TB) and 72% (FB)

Who Should Care

What To Try In 7 Days

Run BigToM on your candidate models to map strengths and failure modes.

Use the causal‑template prompts to create targeted tests for your product's social scenarios.

If you rely on hidden‑state inference, add human review or specialized prompting for backward‑belief cases.

Reproducibility

License

  • MIT

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Model‑generated content can reflect training biases and overproduce stereotyped contexts.
  • Circularity risk: the generator is an LLM, though cross‑generation checks reduce this concern.
  • Stories have syntactic similarity and a small fraction (~1–3%) contain commonsense errors.

When Not To Use

  • Do not rely on BigToM alone for high‑stakes decisions without human review.
  • Avoid using fully automated generation in domains where the model cannot reliably imagine plausible events.

Failure Modes

  • Models may anchor on an explicitly stated initial belief and ignore perceptual updates.
  • Backward belief inference (inferring hidden beliefs from actions) is the weakest and least reliable capability.
  • Generated dataset may underrepresent some real‑world social situations due to steering limits.

Core Entities

Models

  • gpt-4-0314
  • gpt-3.5-turbo
  • text-davinci-003
  • claude-v1.3
  • claude-2
  • llama-65b-q5

Metrics

  • Accuracy

Datasets

  • BigToM
  • SocialIQA
  • ToMi
  • Adv-CSFB

Benchmarks

  • BigToM
  • socialIQa
  • ToMi
  • Adv-CSFB

Context Entities

Models

  • gpt-4-0314

Metrics

  • human quality ratings
  • Accuracy

Datasets

  • BigToM

Benchmarks

  • socialIQa