T4D: a new test that asks LLMs to act on theory-of-mind; FaR prompting raises GPT‑4 from 50% to 71%.

Overview

Decision SnapshotNeeds Validation

FaR gives a practical, low-cost way to improve action choices in social scenarios, but it is brittle to incorrect foresight and tested mainly on templated story data.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, Manaal Faruqui

Links

Abstract / PDF / Data

Why It Matters For Business

If you build agents that must act on what people believe, test them on action-oriented ToM tasks; a structured 'foresee+reflect' prompt can improve decisions cheaply but needs guardrails against bad foresight.

Who Should Care

Product Manager ML Engineer CTO

Summary TLDR

The authors introduce Thinking for Doing (T4D), a new zero-shot benchmark that asks models to pick actions based on inferred mental states (Theory-of-Mind). Standard LLMs (GPT-4, GPT-3.5, PaLM 2) do well on ToM question tests but perform poorly on T4D (GPT‑4: 50% vs human ~90%). They propose a zero-shot prompt, Foresee and Reflect (FaR), which structures thinking about likely futures and then prunes actions; FaR boosts GPT‑4 to ~71% and generalizes to out-of-distribution story variants and Faux Pas scenarios. FaR helps but is brittle to incorrect future predictions and relies on structured story formats.

Problem Statement

Existing ToM tests ask models to state beliefs but not to choose actions based on those beliefs. Real agents must infer others' mental states and then act. The paper asks: can LLMs connect inferred mental states to pragmatic actions, and can a prompt structure improve that ability?

Main Contribution

Introduce T4D, a zero-shot evaluation that converts ToMi stories into action-choice tasks requiring Theory-of-Mind.

Diagnose a core bottleneck: LLMs fail to find the implicit inference steps (who lacks information) needed to choose the right action.

Key Findings

LLMs score near-perfect on ToMi (inference questions) but much worse on T4D (action choices).

NumbersGPT-4: ToMi 93% vs T4D 50% (Table 1)

Practical UseDon't assume good ToM QA implies correct action choices; test agents with action-oriented tasks before deployment.

Evidence RefTable 1

FaR prompt increases GPT-4 zero-shot accuracy on T4D from ~50% to ~71%.

NumbersGPT-4 base 50.2% → FaR 71.4% (Table 3 / Fig.5)

Practical UseAdd structured foresee+reflect prompting to agent decision pipelines to substantially improve action selection without fine-tuning.

Evidence RefFig.5, Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 93%; PaLM2-S 87%; GPT-3.5 74%	Human 100%	—	ToMi (inference QA)	Table 1: LLMs match ToMi inference performance closely	Table 1
Accuracy	GPT-4 50.2%; PaLM2-S 16%; GPT-3.5 15%	Human 90%*	—	T4D (converted from ToMi)	Table 1: action-choice task is harder	Table 1

What To Try In 7 Days

Convert a few user-facing decision scenarios into T4D-style action probes and measure current agent accuracy.

Implement the FaR prompt in your inference pipeline and A/B test action choices against your baseline.

Add a simple verification layer to detect and reject implausible foresight outputs (reduces noisy-foresight risk).

Agent Features

Memory

short-term observations (story context)

Planning

single-step action selectiontheory-of-mind-based action planning via prompts

Frameworks

FaR (Foresee and Reflect prompt)

Is Agentic

Yes

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

ToMi dataset (public); T4D conversion described in Appendix A

Risks & Boundaries

Limitations

T4D is built from templated ToMi stories; real-world interactions are more open-ended.

FaR is sensitive to erroneous foresight; wrong future predictions can hurt more than help.

When Not To Use

For high-stakes actions without verification (FaR can hallucinate futures).

When input contexts are long or non-templated social interactions where implicit assumptions break.

Failure Modes

Noisy or hallucinated foresight leads the model to select the wrong action (accuracy can fall below baseline).

Over-reliance on prompt format causes brittleness to slight phrasing changes.

Core Entities

Models

GPT-4GPT-3.5-turbo (ChatGPT)PaLM 2-S (Bison)PaLM 2-L (Unicorn)

Metrics

Accuracy

Datasets

ToMi (converted)T4D (this work)Faux Pas set (Shapira et al.)Sclar et al. story-structure challenge sets (D1,D2,D3)

Benchmarks

ToMiT4D

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs score near-perfect on ToMi (inference questions) but much worse on T4D (action choices).

FaR prompt increases GPT-4 zero-shot accuracy on T4D from ~50% to ~71%.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding