T4D: a new test that asks LLMs to act on theory-of-mind; FaR prompting raises GPT‑4 from 50% to 71%.

October 4, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

6

Authors

Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, Manaal Faruqui

Links

Abstract / PDF

Why It Matters For Business

If you build agents that must act on what people believe, test them on action-oriented ToM tasks; a structured 'foresee+reflect' prompt can improve decisions cheaply but needs guardrails against bad foresight.

Summary TLDR

The authors introduce Thinking for Doing (T4D), a new zero-shot benchmark that asks models to pick actions based on inferred mental states (Theory-of-Mind). Standard LLMs (GPT-4, GPT-3.5, PaLM 2) do well on ToM question tests but perform poorly on T4D (GPT‑4: 50% vs human ~90%). They propose a zero-shot prompt, Foresee and Reflect (FaR), which structures thinking about likely futures and then prunes actions; FaR boosts GPT‑4 to ~71% and generalizes to out-of-distribution story variants and Faux Pas scenarios. FaR helps but is brittle to incorrect future predictions and relies on structured story formats.

Problem Statement

Existing ToM tests ask models to state beliefs but not to choose actions based on those beliefs. Real agents must infer others' mental states and then act. The paper asks: can LLMs connect inferred mental states to pragmatic actions, and can a prompt structure improve that ability?

Main Contribution

Introduce T4D, a zero-shot evaluation that converts ToMi stories into action-choice tasks requiring Theory-of-Mind.

Diagnose a core bottleneck: LLMs fail to find the implicit inference steps (who lacks information) needed to choose the right action.

Propose FaR (Foresee and Reflect), a zero-shot prompt that structures future prediction and action-aware reflection, improving zero-shot accuracy and generalization.

Key Findings

LLMs score near-perfect on ToMi (inference questions) but much worse on T4D (action choices).

NumbersGPT-4: ToMi 93% vs T4D 50% (Table 1)

FaR prompt increases GPT-4 zero-shot accuracy on T4D from ~50% to ~71%.

NumbersGPT-4 base 50.2% → FaR 71.4% (Table 3 / Fig.5)

Human agreement on T4D is high, so tasks are clear to people.

NumbersHuman raters: ≥17/20 agreement per instance; >90% of instances ≥95% agreement (Section 3.3)

FaR generalizes to diverse story structures and Faux Pas scenarios, often outperforming other zero-shot prompts.

NumbersFaux Pas GPT-4: base 31% → FaR 76% (Table 5); story tests show large FaR gains (Table 4)

FaR is sensitive to incorrect future predictions: noisy foresight can drop performance below baseline.

NumbersGPT-4 FaR noisy-foresee 42% vs FaR 71% and base 50% (Table 3)

Results

Accuracy

ValueGPT-4 93%; PaLM2-S 87%; GPT-3.5 74%

BaselineHuman 100%

Accuracy

ValueGPT-4 50.2%; PaLM2-S 16%; GPT-3.5 15%

BaselineHuman 90%*

FaR improvement (GPT-4)

Value71.4%

Baselinebase 50.2%

FaR ablation (GPT-4)

ValueFaR-NoForesee 53.2%; FaR-NoReflect 59.7%; FaR-NoisyForesee 42%

BaselineFaR 71.4%

Faux Pas generalization (GPT-4)

ValueFaR 76%

Baselinebase 31%

Who Should Care

What To Try In 7 Days

Convert a few user-facing decision scenarios into T4D-style action probes and measure current agent accuracy.

Implement the FaR prompt in your inference pipeline and A/B test action choices against your baseline.

Add a simple verification layer to detect and reject implausible foresight outputs (reduces noisy-foresight risk).

Agent Features

Memory

  • short-term observations (story context)

Planning

  • single-step action selection
  • theory-of-mind-based action planning via prompts

Frameworks

  • FaR (Foresee and Reflect prompt)

Is Agentic

true

Reproducibility

Data Urls

  • ToMi dataset (public); T4D conversion described in Appendix A

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • T4D is built from templated ToMi stories; real-world interactions are more open-ended.
  • FaR is sensitive to erroneous foresight; wrong future predictions can hurt more than help.
  • Experiments use a handful of LLMs and zero-shot calls; no fine-tuning or deployed-agent tests reported.
  • Code and full prompts are only partially described; reproduction may need prompt tuning.

When Not To Use

  • For high-stakes actions without verification (FaR can hallucinate futures).
  • When input contexts are long or non-templated social interactions where implicit assumptions break.
  • If you cannot detect or correct spurious foresight steps.

Failure Modes

  • Noisy or hallucinated foresight leads the model to select the wrong action (accuracy can fall below baseline).
  • Over-reliance on prompt format causes brittleness to slight phrasing changes.
  • FaR may not scale to multi-step real-world plans requiring persistent memory.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo (ChatGPT)
  • PaLM 2-S (Bison)
  • PaLM 2-L (Unicorn)

Metrics

  • Accuracy

Datasets

  • ToMi (converted)
  • T4D (this work)
  • Faux Pas set (Shapira et al.)
  • Sclar et al. story-structure challenge sets (D1,D2,D3)

Benchmarks

  • ToMi
  • T4D