T4D: a new test that asks LLMs to act on theory-of-mind; FaR prompting raises GPT‑4 from 50% to 71%.

October 4, 20238 min

Overview

Decision SnapshotNeeds Validation

FaR gives a practical, low-cost way to improve action choices in social scenarios, but it is brittle to incorrect foresight and tested mainly on templated story data.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, Manaal Faruqui

Links

Abstract / PDF / Data

Why It Matters For Business

If you build agents that must act on what people believe, test them on action-oriented ToM tasks; a structured 'foresee+reflect' prompt can improve decisions cheaply but needs guardrails against bad foresight.

Who Should Care

Summary TLDR

The authors introduce Thinking for Doing (T4D), a new zero-shot benchmark that asks models to pick actions based on inferred mental states (Theory-of-Mind). Standard LLMs (GPT-4, GPT-3.5, PaLM 2) do well on ToM question tests but perform poorly on T4D (GPT‑4: 50% vs human ~90%). They propose a zero-shot prompt, Foresee and Reflect (FaR), which structures thinking about likely futures and then prunes actions; FaR boosts GPT‑4 to ~71% and generalizes to out-of-distribution story variants and Faux Pas scenarios. FaR helps but is brittle to incorrect future predictions and relies on structured story formats.

Problem Statement

Existing ToM tests ask models to state beliefs but not to choose actions based on those beliefs. Real agents must infer others' mental states and then act. The paper asks: can LLMs connect inferred mental states to pragmatic actions, and can a prompt structure improve that ability?

Main Contribution

Introduce T4D, a zero-shot evaluation that converts ToMi stories into action-choice tasks requiring Theory-of-Mind.

Diagnose a core bottleneck: LLMs fail to find the implicit inference steps (who lacks information) needed to choose the right action.

Key Findings

LLMs score near-perfect on ToMi (inference questions) but much worse on T4D (action choices).

NumbersGPT-4: ToMi 93% vs T4D 50% (Table 1)

Practical UseDon't assume good ToM QA implies correct action choices; test agents with action-oriented tasks before deployment.

Evidence RefTable 1

FaR prompt increases GPT-4 zero-shot accuracy on T4D from ~50% to ~71%.

NumbersGPT-4 base 50.2% → FaR 71.4% (Table 3 / Fig.5)

Practical UseAdd structured foresee+reflect prompting to agent decision pipelines to substantially improve action selection without fine-tuning.

Evidence RefFig.5, Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 93%; PaLM2-S 87%; GPT-3.5 74%Human 100%ToMi (inference QA)Table 1: LLMs match ToMi inference performance closelyTable 1
AccuracyGPT-4 50.2%; PaLM2-S 16%; GPT-3.5 15%Human 90%*T4D (converted from ToMi)Table 1: action-choice task is harderTable 1

What To Try In 7 Days

Convert a few user-facing decision scenarios into T4D-style action probes and measure current agent accuracy.

Implement the FaR prompt in your inference pipeline and A/B test action choices against your baseline.

Add a simple verification layer to detect and reject implausible foresight outputs (reduces noisy-foresight risk).

Agent Features

Memory
short-term observations (story context)
Planning
single-step action selectiontheory-of-mind-based action planning via prompts
Frameworks
FaR (Foresee and Reflect prompt)
Is Agentic

Yes

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

ToMi dataset (public); T4D conversion described in Appendix A

Risks & Boundaries

Limitations

T4D is built from templated ToMi stories; real-world interactions are more open-ended.

FaR is sensitive to erroneous foresight; wrong future predictions can hurt more than help.

When Not To Use

For high-stakes actions without verification (FaR can hallucinate futures).

When input contexts are long or non-templated social interactions where implicit assumptions break.

Failure Modes

Noisy or hallucinated foresight leads the model to select the wrong action (accuracy can fall below baseline).

Over-reliance on prompt format causes brittleness to slight phrasing changes.

Core Entities

Models

GPT-4GPT-3.5-turbo (ChatGPT)PaLM 2-S (Bison)PaLM 2-L (Unicorn)

Metrics

Accuracy

Datasets

ToMi (converted)T4D (this work)Faux Pas set (Shapira et al.)Sclar et al. story-structure challenge sets (D1,D2,D3)

Benchmarks

ToMiT4D