LLMs show some social reasoning but fail adversarial and robust tests

May 24, 20236 min

Overview

Decision SnapshotNeeds Validation

The paper provides multi-dataset, multi-model evidence that LLMs are brittle on social reasoning; findings are reproducible with the released code and data.

Citations36

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 25%

Novelty: 45%

Authors

Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Don't assume LLMs understand people just because they give human-like answers; test models with adversarial and diverse benchmarks before using them for social judgments.

Who Should Care

Summary TLDR

This paper runs a broad evaluation of 15 large language models on six Theory-of-Mind (ToM) benchmarks and a new adversarial dataset (Adv-CSFB). Results show models can solve some narrow ToM-style tasks (e.g., 96% on TriangleCOPA by flan-t5-xxl) but fail or fall to baseline on others (e.g., GPT-4 story-level 27% on FauxPas-EAI). Models are sensitive to probing method and brittle to small adversarial changes, suggesting reliance on surface patterns rather than robust social reasoning. The authors release data and code and recommend multi-dataset, automatic evaluations over anecdotal claims.

Problem Statement

Do modern LLMs truly possess robust Theory-of-Mind (ToM) skills, or do they rely on shallow cues and dataset artifacts? The paper measures LLM ToM across six benchmarks, probes sensitivity to prompting styles, and introduces adversarial examples to test robustness.

Main Contribution

Large-scale meta-evaluation of 15 LLMs on six ToM-related benchmarks, comparing to most-frequent-class baselines.

New adversarial dataset Adv-CSFB that adds true-belief and adversarial variants to common false-belief tests.

Key Findings

Some models excel on narrow ToM-style tasks but not across the board

NumbersTriangleCOPA: flan-t5-xxl 96% vs MFC 52%

Practical UseHigh accuracy on one dataset doesn't mean general social reasoning; test multiple datasets before trusting a model's ToM.

Evidence RefFigure 1; Table 7

Performance drops dramatically on some human-style social tests

NumbersFauxPas-EAI story-level: GPT-4 27% (below simple baseline)

Practical UseDon't use LLM answers on subtle social judgments or faux‑pas detection in production without extra checks or human oversight.

Evidence Ref§6; Table 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy96%MFC 52%+44ppTriangleCOPA (best model flan-t5-xxl)Figure 1; Table 7Table 7
Accuracy27%MFC ~30%-3ppFauxPas-EAI (GPT-4)§6; Table 7Table 7

What To Try In 7 Days

Run your model on a small Adv-CSFB-style set: transparent, unreadable labels, late labels, trusted testimony.

Compare MC, probability-based, and Chain-of-Thought prompts on your tasks and report differences.

Break datasets into question-type splits (facts vs beliefs) and check worst-case split performance.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Datasets are limited in scope and size; ToM in real life is broader than these tests.

Some dataset items may be ambiguous; human label variation could affect scores.

When Not To Use

As proof that LLMs have human-like social cognition or consciousness.

To justify deploying models for sensitive social judgments without human oversight.

Failure Modes

Over-reliance on surface patterns and lexicon matches (Clever Hans effect).

High sensitivity to prompt format and probing method.

Core Entities

Models

gpt-4-0314gpt-3.5-turbo-0301text-davinci-003text-davinci-002flan-t5-xxlflan-ul2j2-jumbo-instructj2-grande-instructj2-jumboj2-grandej2-largeflan-t5-xlflan-t5-largeflan-t5-baseflan-t5-small

Metrics

Accuracy

Datasets

TriangleCOPASocialIQaToMiToMi'ToM-kEpistemic ReasoningFauxPas-EAIAdv-CSFB

Benchmarks

Adv-CSFBToMi'SocialIQaTriangleCOPAFauxPas-EAIEpistemic Reasoning