Survey of how LLMs reason strategically in multi-agent games, economics, and social simulations

April 1, 20246 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.3

Citation Count

6

Authors

Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, Furu Wei

Links

Abstract / PDF

Why It Matters For Business

LLM-driven agents can model multi-party dynamics (negotiations, markets, simulations) and improve decision-making, but measurement and domain alignment matter more than raw model size.

Summary TLDR

This short survey organizes work on using large language models (LLMs) for strategic reasoning—anticipating and influencing other agents in multi-player settings. It defines strategic reasoning, groups applications into societal, economic, game-theory, and gaming domains, reviews methods (prompting, modular agents, theory-of-mind, imitation/RL), and argues for unified benchmarks and mixed quantitative/qualitative evaluation. The paper flags gaps: missing standard benchmarks, uncertain scaling effects, and bias risks.

Problem Statement

Strategic reasoning means predicting and shaping others' actions in dynamic multi-agent settings. The field now has many ad hoc LLM uses across games, economics, and social simulation, but lacks a unified taxonomy, standardized benchmarks, and clear knowledge of what model sizes or methods reliably deliver human-like strategic abilities.

Main Contribution

Define strategic reasoning for LLMs and contrast it with other reasoning types.

Taxonomy of application scenarios: societal simulation, economic simulation, game theory, and gaming.

Survey methods to improve strategic reasoning: prompt engineering, modular agents, theory-of-mind, and imitation/RL.

Review evaluation practices and call for unified benchmarks and mixed quantitative/qualitative metrics.

Identify open challenges and research directions, including benchmark design and limits of scaling.

Key Findings

LLM strategic work spans four scenario families: societal, economic, game-theory, and gaming.

Numbers4 scenario categories

A modular agent (OG-Narrator) reported a tenfold profit boost over baselines in a bargaining context.

Numbers10× profitability vs baselines

Evaluations use outcome metrics (win/survival rates) and process metrics (opponent prediction accuracy) together.

There is no widely adopted unified benchmark for strategic reasoning.

Results

profitability (bargaining)

Value10×

Baselineprior baselines

Who Should Care

What To Try In 7 Days

Run a small multi-agent simulation (e.g., auction or negotiation) with an off-the-shelf LLM and log win/profit outcomes.

Experiment with task-specific prompts and a simple deterministic module (price proposal or rule engine) to compare returns.

Evaluate both outcomes (win rate, profit) and process signals (opponent prediction accuracy, belief updates).

Agent Features

Memory

  • short-term dialog history
  • retrieval memory (historical game logs)
  • multi-frame summaries

Planning

  • K-level reasoning
  • chain-of-thought summaries
  • multi-frame summarization

Tool Use

  • external retrieval
  • deterministic submodules (quote generators)
  • summarization modules

Frameworks

  • Alympics
  • LLMArena
  • GTBench
  • OpenToM

Is Agentic

true

Architectures

  • LLM-based agents (GPT-family)

Collaboration

  • multi-agent coordination
  • opponent modeling / theory-of-mind

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • No standard unified benchmark across diverse strategic domains.
  • Unclear mapping from model size/configuration to strategic ability.
  • Potential social and political biases when simulating human interactions.
  • Heterogeneous evaluation methods prevent direct comparisons.

When Not To Use

  • High‑stakes or safety‑critical decisions that need verifiable guarantees.
  • Real‑time control where latency and sensor integration dominate.
  • Environments where precise numeric optimization is required without human-readable reasoning.

Failure Modes

  • Hallucinated strategies or incorrect parsing of action spaces.
  • Bias amplification in social or political simulations.
  • Brittleness to nonstationary or adversarial opponents.
  • Overreliance on scaling rather than structured modules or feedback.

Core Entities

Models

  • GPT-4
  • general LLMs (GPT-family and similar)

Metrics

  • win rate
  • survival rate
  • reward
  • Normalized Relative Advantage (NRA)
  • TrueSkill
  • Accuracy

Benchmarks

  • GTBench
  • LLMArena
  • Alympics
  • OpenToM
  • BigToM
  • WarAgent
  • AucArena
  • CompeteAI

Context Entities

Models

  • ChessGPT
  • Retroformer
  • Thinker
  • Suspicion-Agent