LLMs show some social reasoning but fail adversarial and robust tests

Overview

Decision SnapshotNeeds Validation

The paper provides multi-dataset, multi-model evidence that LLMs are brittle on social reasoning; findings are reproducible with the released code and data.

Citations36

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 25%

Novelty: 45%

Authors

Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Don't assume LLMs understand people just because they give human-like answers; test models with adversarial and diverse benchmarks before using them for social judgments.

Who Should Care

Product Manager ML Engineer Founder Data Scientist

Summary TLDR

This paper runs a broad evaluation of 15 large language models on six Theory-of-Mind (ToM) benchmarks and a new adversarial dataset (Adv-CSFB). Results show models can solve some narrow ToM-style tasks (e.g., 96% on TriangleCOPA by flan-t5-xxl) but fail or fall to baseline on others (e.g., GPT-4 story-level 27% on FauxPas-EAI). Models are sensitive to probing method and brittle to small adversarial changes, suggesting reliance on surface patterns rather than robust social reasoning. The authors release data and code and recommend multi-dataset, automatic evaluations over anecdotal claims.

Problem Statement

Do modern LLMs truly possess robust Theory-of-Mind (ToM) skills, or do they rely on shallow cues and dataset artifacts? The paper measures LLM ToM across six benchmarks, probes sensitivity to prompting styles, and introduces adversarial examples to test robustness.

Main Contribution

Large-scale meta-evaluation of 15 LLMs on six ToM-related benchmarks, comparing to most-frequent-class baselines.

New adversarial dataset Adv-CSFB that adds true-belief and adversarial variants to common false-belief tests.

Key Findings

Some models excel on narrow ToM-style tasks but not across the board

NumbersTriangleCOPA: flan-t5-xxl 96% vs MFC 52%

Practical UseHigh accuracy on one dataset doesn't mean general social reasoning; test multiple datasets before trusting a model's ToM.

Evidence RefFigure 1; Table 7

Performance drops dramatically on some human-style social tests

NumbersFauxPas-EAI story-level: GPT-4 27% (below simple baseline)

Practical UseDon't use LLM answers on subtle social judgments or faux‑pas detection in production without extra checks or human oversight.

Evidence Ref§6; Table 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	96%	MFC 52%	+44pp	TriangleCOPA (best model flan-t5-xxl)	Figure 1; Table 7	Table 7
Accuracy	27%	MFC ~30%	-3pp	FauxPas-EAI (GPT-4)	§6; Table 7	Table 7

What To Try In 7 Days

Run your model on a small Adv-CSFB-style set: transparent, unreadable labels, late labels, trusted testimony.

Compare MC, probability-based, and Chain-of-Thought prompts on your tasks and report differences.

Break datasets into question-type splits (facts vs beliefs) and check worst-case split performance.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/salavi/Clever_Hans_or_N-ToM

Data URLs

https://github.com/salavi/Clever_Hans_or_N-ToM

Risks & Boundaries

Limitations

Datasets are limited in scope and size; ToM in real life is broader than these tests.

Some dataset items may be ambiguous; human label variation could affect scores.

When Not To Use

As proof that LLMs have human-like social cognition or consciousness.

To justify deploying models for sensitive social judgments without human oversight.

Failure Modes

Over-reliance on surface patterns and lexicon matches (Clever Hans effect).

High sensitivity to prompt format and probing method.

Core Entities

Models

gpt-4-0314gpt-3.5-turbo-0301text-davinci-003text-davinci-002flan-t5-xxlflan-ul2j2-jumbo-instructj2-grande-instructj2-jumboj2-grandej2-largeflan-t5-xlflan-t5-largeflan-t5-baseflan-t5-small

Metrics

Accuracy

Datasets

TriangleCOPASocialIQaToMiToMi'ToM-kEpistemic ReasoningFauxPas-EAIAdv-CSFB

Benchmarks

Adv-CSFBToMi'SocialIQaTriangleCOPAFauxPas-EAIEpistemic Reasoning

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Some models excel on narrow ToM-style tasks but not across the board

Performance drops dramatically on some human-style social tests

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

A meta-agent that auto-generates persona-driven adversarial tests and judges agents to find deeper failures fast

Key finding

LLMsPark: a game-theory benchmark that tests LLMs as strategic, social agents

Key finding