Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
6
Why It Matters For Business
If you build agents that must act on what people believe, test them on action-oriented ToM tasks; a structured 'foresee+reflect' prompt can improve decisions cheaply but needs guardrails against bad foresight.
Summary TLDR
The authors introduce Thinking for Doing (T4D), a new zero-shot benchmark that asks models to pick actions based on inferred mental states (Theory-of-Mind). Standard LLMs (GPT-4, GPT-3.5, PaLM 2) do well on ToM question tests but perform poorly on T4D (GPT‑4: 50% vs human ~90%). They propose a zero-shot prompt, Foresee and Reflect (FaR), which structures thinking about likely futures and then prunes actions; FaR boosts GPT‑4 to ~71% and generalizes to out-of-distribution story variants and Faux Pas scenarios. FaR helps but is brittle to incorrect future predictions and relies on structured story formats.
Problem Statement
Existing ToM tests ask models to state beliefs but not to choose actions based on those beliefs. Real agents must infer others' mental states and then act. The paper asks: can LLMs connect inferred mental states to pragmatic actions, and can a prompt structure improve that ability?
Main Contribution
Introduce T4D, a zero-shot evaluation that converts ToMi stories into action-choice tasks requiring Theory-of-Mind.
Diagnose a core bottleneck: LLMs fail to find the implicit inference steps (who lacks information) needed to choose the right action.
Propose FaR (Foresee and Reflect), a zero-shot prompt that structures future prediction and action-aware reflection, improving zero-shot accuracy and generalization.
Key Findings
LLMs score near-perfect on ToMi (inference questions) but much worse on T4D (action choices).
FaR prompt increases GPT-4 zero-shot accuracy on T4D from ~50% to ~71%.
Human agreement on T4D is high, so tasks are clear to people.
FaR generalizes to diverse story structures and Faux Pas scenarios, often outperforming other zero-shot prompts.
FaR is sensitive to incorrect future predictions: noisy foresight can drop performance below baseline.
Results
Accuracy
Accuracy
FaR improvement (GPT-4)
FaR ablation (GPT-4)
Faux Pas generalization (GPT-4)
Who Should Care
What To Try In 7 Days
Convert a few user-facing decision scenarios into T4D-style action probes and measure current agent accuracy.
Implement the FaR prompt in your inference pipeline and A/B test action choices against your baseline.
Add a simple verification layer to detect and reject implausible foresight outputs (reduces noisy-foresight risk).
Agent Features
Memory
- short-term observations (story context)
Planning
- single-step action selection
- theory-of-mind-based action planning via prompts
Frameworks
- FaR (Foresee and Reflect prompt)
Is Agentic
true
Reproducibility
Data Urls
- ToMi dataset (public); T4D conversion described in Appendix A
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- T4D is built from templated ToMi stories; real-world interactions are more open-ended.
- FaR is sensitive to erroneous foresight; wrong future predictions can hurt more than help.
- Experiments use a handful of LLMs and zero-shot calls; no fine-tuning or deployed-agent tests reported.
- Code and full prompts are only partially described; reproduction may need prompt tuning.
When Not To Use
- For high-stakes actions without verification (FaR can hallucinate futures).
- When input contexts are long or non-templated social interactions where implicit assumptions break.
- If you cannot detect or correct spurious foresight steps.
Failure Modes
- Noisy or hallucinated foresight leads the model to select the wrong action (accuracy can fall below baseline).
- Over-reliance on prompt format causes brittleness to slight phrasing changes.
- FaR may not scale to multi-step real-world plans requiring persistent memory.
Core Entities
Models
- GPT-4
- GPT-3.5-turbo (ChatGPT)
- PaLM 2-S (Bison)
- PaLM 2-L (Unicorn)
Metrics
- Accuracy
Datasets
- ToMi (converted)
- T4D (this work)
- Faux Pas set (Shapira et al.)
- Sclar et al. story-structure challenge sets (D1,D2,D3)
Benchmarks
- ToMi
- T4D

