LLMsPark: a game-theory benchmark that tests LLMs as strategic, social agents

September 20, 20256 min

Overview

Decision SnapshotNeeds Validation

The benchmark fills a gap by testing interactive, multi-agent strategic behaviors using standard games and Elo ratings; results show clear, game-specific strengths but require compute for parallel LLM runs.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Junhao Chen, Jingbo Sun, Xiang Li, Haidong Xin, Yuhao Xue, Yibin Xu, Hao Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMsPark helps pick the right model for interactive or strategic applications because models differ by game-style strength; test candidate models in scenario-like games before deployment.

Who Should Care

Summary TLDR

LLMsPark is a public, text-only benchmark that runs LLMs as autonomous players in five classic games (Prisoner's Dilemma, Trust, Nim, Dictator, Who Is Spy). It evaluates strategic and social behaviors (cooperation, deception, leadership) using per-game scores and an Elo-style ranking (K=32). Results on 15 models show no single winner: GPT-4 leads multi-round Prisoner's Dilemma and Trust; Qwen-14B-Chat tops Who Is Spy; other open-source models sometimes beat commercial ones. The platform and results are available at the project site.

Problem Statement

Most LLM benchmarks test static knowledge or single-turn tasks and miss multi-agent, interactive strategic behavior. Teams need a reproducible, game-based testbed to measure planning, deception, trust, and long-term strategy across models.

Main Contribution

A game-theoretic benchmark (LLMsPark) that runs LLMs as autonomous players across five classic games.

Behavioral analysis that categorizes emergent strategies (trust, confrontation, pretense, leadership, deception).

Key Findings

GPT-4 performed best in multi-round Prisoner's Dilemma and Trust Game.

NumbersPrisoner's Dilemma (Multi) score = 1285.80; Trust (Multi) = 1247.80

Practical UseUse GPT-4 when you need long-horizon social strategy and trust-building in multi-round interactions.

Evidence RefTable 1

Qwen-14B-Chat achieved top performance on the social-deduction game Who Is Spy.

NumbersWho Is Spy score = 77 (highest among tested models)

Practical UseFor tasks needing subtle camouflage, selective voting, and multi-turn information filtering, test Qwen-14B-Chat as a candidate.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Prisoner's Dilemma (Multi) - GPT-41285.80Prisoner's Dilemma (multi-round)Table 1 reports GPT-4 PD (Multi) = 1285.80Table 1
Who Is Spy - Qwen-14B-Chat77Who Is Spy (multi-player social deduction)Table 1 lists Who Is Spy score = 77 for Qwen-14B-Chat (highest)Table 1

What To Try In 7 Days

Run LLMsPark or run the Who Is Spy test on candidate models to check multi-turn reasoning and deception handling.

Compare models in single-round vs multi-round variants to surface short-term vs long-term strategy differences.

Add a small subset of open-source models (e.g., Phoenix, Baichuan) to your evaluation shortlist—they can beat commercial models in some games.

Agent Features

Memory
short-term memory and history tracking (record past rounds)
Planning
multi-round planning and opponent modeling
Frameworks
LLMsPark agent framework
Is Agentic

Yes

Architectures
Perception-Brain-Action generic agent architecture
Collaboration
multi-agent interaction and social reasoning

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Some models use rigid strategies and fail to adapt to novel scenarios.

Latency and compute cost hinder fast-paced game performance.

When Not To Use

For single-turn classification or standard QA tasks where static benchmarks suffice.

For multimodal games; LLMsPark targets text-only agent interactions.

Failure Modes

Over-elaboration reveals a spy role, hurting social-deduction performance.

Safety-aligned default behavior leads to predictable, suboptimal short-term moves.

Core Entities

Models

Baichuan-7BBaichuan2-7B-ChatPhoenix-Inst-Chat-7BChatGLM-6BChatGLM2-6BChatGLM-ProChatYuan-Large-v2SFTDolly-v2-12BCharacterGLMGPT-3.5-TurboGPT-4RWKV-4-World-7BMiniMax-abab5-ChatQwen-14B-Chat

Metrics

Elo-style rating (K=32)Per-game score (game-specific payoff)Win/lose/draw outcomes

Datasets

Prisoner's Dilemma (simulated games)Trust Game (simulated games)Nim (simulated games)Dictator Game (simulated games)Who Is Spy (simulated social deduction games)

Benchmarks

LLMsPark