Overview
FaR gives a practical, low-cost way to improve action choices in social scenarios, but it is brittle to incorrect foresight and tested mainly on templated story data.
Citations6
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 40%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
If you build agents that must act on what people believe, test them on action-oriented ToM tasks; a structured 'foresee+reflect' prompt can improve decisions cheaply but needs guardrails against bad foresight.
Who Should Care
Summary TLDR
The authors introduce Thinking for Doing (T4D), a new zero-shot benchmark that asks models to pick actions based on inferred mental states (Theory-of-Mind). Standard LLMs (GPT-4, GPT-3.5, PaLM 2) do well on ToM question tests but perform poorly on T4D (GPT‑4: 50% vs human ~90%). They propose a zero-shot prompt, Foresee and Reflect (FaR), which structures thinking about likely futures and then prunes actions; FaR boosts GPT‑4 to ~71% and generalizes to out-of-distribution story variants and Faux Pas scenarios. FaR helps but is brittle to incorrect future predictions and relies on structured story formats.
Problem Statement
Existing ToM tests ask models to state beliefs but not to choose actions based on those beliefs. Real agents must infer others' mental states and then act. The paper asks: can LLMs connect inferred mental states to pragmatic actions, and can a prompt structure improve that ability?
Main Contribution
Introduce T4D, a zero-shot evaluation that converts ToMi stories into action-choice tasks requiring Theory-of-Mind.
Diagnose a core bottleneck: LLMs fail to find the implicit inference steps (who lacks information) needed to choose the right action.
Key Findings
LLMs score near-perfect on ToMi (inference questions) but much worse on T4D (action choices).
FaR prompt increases GPT-4 zero-shot accuracy on T4D from ~50% to ~71%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 93%; PaLM2-S 87%; GPT-3.5 74% | Human 100% | — | ToMi (inference QA) | Table 1: LLMs match ToMi inference performance closely | Table 1 |
| Accuracy | GPT-4 50.2%; PaLM2-S 16%; GPT-3.5 15% | Human 90%* | — | T4D (converted from ToMi) | Table 1: action-choice task is harder | Table 1 |
What To Try In 7 Days
Convert a few user-facing decision scenarios into T4D-style action probes and measure current agent accuracy.
Implement the FaR prompt in your inference pipeline and A/B test action choices against your baseline.
Add a simple verification layer to detect and reject implausible foresight outputs (reduces noisy-foresight risk).
Agent Features
Memory
Planning
Frameworks
Is Agentic
Yes
Reproducibility
Data URLs
Risks & Boundaries
Limitations
T4D is built from templated ToMi stories; real-world interactions are more open-ended.
FaR is sensitive to erroneous foresight; wrong future predictions can hurt more than help.
When Not To Use
For high-stakes actions without verification (FaR can hallucinate futures).
When input contexts are long or non-templated social interactions where implicit assumptions break.
Failure Modes
Noisy or hallucinated foresight leads the model to select the wrong action (accuracy can fall below baseline).
Over-reliance on prompt format causes brittleness to slight phrasing changes.

