Survey of how LLMs reason strategically in multi-agent games, economics, and social simulations

April 1, 20246 min

Overview

Decision SnapshotNeeds Validation

The survey compiles existing studies and examples, but many claims rely on varied, nonstandardized evaluations; expect medium evidence strength and limited out-of-the-box production readiness.

Citations6

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/1

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 40%

Authors

Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, Furu Wei

Links

Abstract / PDF

Why It Matters For Business

LLM-driven agents can model multi-party dynamics (negotiations, markets, simulations) and improve decision-making, but measurement and domain alignment matter more than raw model size.

Who Should Care

Summary TLDR

This short survey organizes work on using large language models (LLMs) for strategic reasoning—anticipating and influencing other agents in multi-player settings. It defines strategic reasoning, groups applications into societal, economic, game-theory, and gaming domains, reviews methods (prompting, modular agents, theory-of-mind, imitation/RL), and argues for unified benchmarks and mixed quantitative/qualitative evaluation. The paper flags gaps: missing standard benchmarks, uncertain scaling effects, and bias risks.

Problem Statement

Strategic reasoning means predicting and shaping others' actions in dynamic multi-agent settings. The field now has many ad hoc LLM uses across games, economics, and social simulation, but lacks a unified taxonomy, standardized benchmarks, and clear knowledge of what model sizes or methods reliably deliver human-like strategic abilities.

Main Contribution

Define strategic reasoning for LLMs and contrast it with other reasoning types.

Taxonomy of application scenarios: societal simulation, economic simulation, game theory, and gaming.

Key Findings

LLM strategic work spans four scenario families: societal, economic, game-theory, and gaming.

Numbers4 scenario categories

Practical UseMap your use case to one of these categories first; pick datasets, metrics, and baselines tailored to that domain.

Evidence RefSection 3, Figure 2

A modular agent (OG-Narrator) reported a tenfold profit boost over baselines in a bargaining context.

Numbers10× profitability vs baselines

Practical UseFor negotiation tasks, combine deterministic modules (e.g., controlled quote generator) with an LLM narrator to improve returns quickly.

Evidence RefSection 4, OG-Narrator example

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
profitability (bargaining)10×prior baselines10×bargaining/OG-Narrator studyOG-Narrator incorporates deterministic quote module and LLM narratorSection 4

What To Try In 7 Days

Run a small multi-agent simulation (e.g., auction or negotiation) with an off-the-shelf LLM and log win/profit outcomes.

Experiment with task-specific prompts and a simple deterministic module (price proposal or rule engine) to compare returns.

Evaluate both outcomes (win rate, profit) and process signals (opponent prediction accuracy, belief updates).

Agent Features

Memory
short-term dialog historyretrieval memory (historical game logs)multi-frame summaries
Planning
K-level reasoningchain-of-thought summariesmulti-frame summarization
Tool Use
external retrievaldeterministic submodules (quote generators)summarization modules
Frameworks
AlympicsLLMArenaGTBenchOpenToM
Is Agentic

Yes

Architectures
LLM-based agents (GPT-family)
Collaboration
multi-agent coordinationopponent modeling / theory-of-mind

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

No standard unified benchmark across diverse strategic domains.

Unclear mapping from model size/configuration to strategic ability.

When Not To Use

High‑stakes or safety‑critical decisions that need verifiable guarantees.

Real‑time control where latency and sensor integration dominate.

Failure Modes

Hallucinated strategies or incorrect parsing of action spaces.

Bias amplification in social or political simulations.

Core Entities

Models

GPT-4general LLMs (GPT-family and similar)

Metrics

win ratesurvival raterewardNormalized Relative Advantage (NRA)TrueSkillAccuracy

Benchmarks

GTBenchLLMArenaAlympicsOpenToMBigToMWarAgentAucArenaCompeteAI

Context Entities

Models

ChessGPTRetroformerThinkerSuspicion-Agent