Survey of how LLMs reason strategically in multi-agent games, economics, and social simulations

Overview

Decision SnapshotNeeds Validation

The survey compiles existing studies and examples, but many claims rely on varied, nonstandardized evaluations; expect medium evidence strength and limited out-of-the-box production readiness.

Citations6

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/1

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 40%

Authors

Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, Furu Wei

Links

Abstract / PDF

Why It Matters For Business

LLM-driven agents can model multi-party dynamics (negotiations, markets, simulations) and improve decision-making, but measurement and domain alignment matter more than raw model size.

Who Should Care

Product Manager CTO ML Engineer Founder Data Scientist

Summary TLDR

This short survey organizes work on using large language models (LLMs) for strategic reasoning—anticipating and influencing other agents in multi-player settings. It defines strategic reasoning, groups applications into societal, economic, game-theory, and gaming domains, reviews methods (prompting, modular agents, theory-of-mind, imitation/RL), and argues for unified benchmarks and mixed quantitative/qualitative evaluation. The paper flags gaps: missing standard benchmarks, uncertain scaling effects, and bias risks.

Problem Statement

Strategic reasoning means predicting and shaping others' actions in dynamic multi-agent settings. The field now has many ad hoc LLM uses across games, economics, and social simulation, but lacks a unified taxonomy, standardized benchmarks, and clear knowledge of what model sizes or methods reliably deliver human-like strategic abilities.

Main Contribution

Define strategic reasoning for LLMs and contrast it with other reasoning types.

Taxonomy of application scenarios: societal simulation, economic simulation, game theory, and gaming.

Key Findings

LLM strategic work spans four scenario families: societal, economic, game-theory, and gaming.

Numbers4 scenario categories

Practical UseMap your use case to one of these categories first; pick datasets, metrics, and baselines tailored to that domain.

Evidence RefSection 3, Figure 2

A modular agent (OG-Narrator) reported a tenfold profit boost over baselines in a bargaining context.

Numbers10× profitability vs baselines

Practical UseFor negotiation tasks, combine deterministic modules (e.g., controlled quote generator) with an LLM narrator to improve returns quickly.

Evidence RefSection 4, OG-Narrator example

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
profitability (bargaining)	10×	prior baselines	10×	bargaining/OG-Narrator study	OG-Narrator incorporates deterministic quote module and LLM narrator	Section 4

What To Try In 7 Days

Run a small multi-agent simulation (e.g., auction or negotiation) with an off-the-shelf LLM and log win/profit outcomes.

Experiment with task-specific prompts and a simple deterministic module (price proposal or rule engine) to compare returns.

Evaluate both outcomes (win rate, profit) and process signals (opponent prediction accuracy, belief updates).

Agent Features

Memory

short-term dialog historyretrieval memory (historical game logs)multi-frame summaries

Planning

K-level reasoningchain-of-thought summariesmulti-frame summarization

Tool Use

external retrievaldeterministic submodules (quote generators)summarization modules

Frameworks

AlympicsLLMArenaGTBenchOpenToM

Is Agentic

Yes

Architectures

LLM-based agents (GPT-family)

Collaboration

multi-agent coordinationopponent modeling / theory-of-mind

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

No standard unified benchmark across diverse strategic domains.

Unclear mapping from model size/configuration to strategic ability.

When Not To Use

High‑stakes or safety‑critical decisions that need verifiable guarantees.

Real‑time control where latency and sensor integration dominate.

Failure Modes

Hallucinated strategies or incorrect parsing of action spaces.

Bias amplification in social or political simulations.

Core Entities

Models

GPT-4general LLMs (GPT-family and similar)

Metrics

win ratesurvival raterewardNormalized Relative Advantage (NRA)TrueSkillAccuracy

Benchmarks

GTBenchLLMArenaAlympicsOpenToMBigToMWarAgentAucArenaCompeteAI

Context Entities

Models

ChessGPTRetroformerThinkerSuspicion-Agent

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM strategic work spans four scenario families: societal, economic, game-theory, and gaming.

A modular agent (OG-Narrator) reported a tenfold profit boost over baselines in a bargaining context.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

Context Entities

Models

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding