LLMsPark: a game-theory benchmark that tests LLMs as strategic, social agents

September 20, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Junhao Chen, Jingbo Sun, Xiang Li, Haidong Xin, Yuhao Xue, Yibin Xu, Hao Zhao

Links

Abstract / PDF

Why It Matters For Business

LLMsPark helps pick the right model for interactive or strategic applications because models differ by game-style strength; test candidate models in scenario-like games before deployment.

Summary TLDR

LLMsPark is a public, text-only benchmark that runs LLMs as autonomous players in five classic games (Prisoner's Dilemma, Trust, Nim, Dictator, Who Is Spy). It evaluates strategic and social behaviors (cooperation, deception, leadership) using per-game scores and an Elo-style ranking (K=32). Results on 15 models show no single winner: GPT-4 leads multi-round Prisoner's Dilemma and Trust; Qwen-14B-Chat tops Who Is Spy; other open-source models sometimes beat commercial ones. The platform and results are available at the project site.

Problem Statement

Most LLM benchmarks test static knowledge or single-turn tasks and miss multi-agent, interactive strategic behavior. Teams need a reproducible, game-based testbed to measure planning, deception, trust, and long-term strategy across models.

Main Contribution

A game-theoretic benchmark (LLMsPark) that runs LLMs as autonomous players across five classic games.

Behavioral analysis that categorizes emergent strategies (trust, confrontation, pretense, leadership, deception).

Public release of the platform, model evaluations, and leaderboards at the project website.

Key Findings

GPT-4 performed best in multi-round Prisoner's Dilemma and Trust Game.

NumbersPrisoner's Dilemma (Multi) score = 1285.80; Trust (Multi) = 1247.80

Qwen-14B-Chat achieved top performance on the social-deduction game Who Is Spy.

NumbersWho Is Spy score = 77 (highest among tested models)

No model is uniformly best; different models top different games.

NumbersTop scorers vary by game (e.g., GPT-4 PD multi 1285.80; Phoenix-Inst-Chat-7B PD single 1236.50; Qwen Who Is Spy 77)

Some open-source models outperform or rival commercial models in specific settings.

NumbersPhoenix-Inst-Chat-7B PD (Single) = 1236.50 > GPT-4 PD (Single) = 1062.07

Only a subset of models handled the complexity of Who Is Spy.

Numbers6 models handled Who Is Spy: Moss-Moon-003-SFT, ChatGLM-Pro, GPT-3.5-Turbo, GPT-4, MiniMax-abab5-Chat, Qwen-14B-Chat

Results

Prisoner's Dilemma (Multi) - GPT-4

Value1285.80

Who Is Spy - Qwen-14B-Chat

Value77

Prisoner's Dilemma (Single) - Phoenix-Inst-Chat-7B

Value1236.50

BaselineGPT-4 single PD = 1062.07

Elo update K

Value32

Who Should Care

What To Try In 7 Days

Run LLMsPark or run the Who Is Spy test on candidate models to check multi-turn reasoning and deception handling.

Compare models in single-round vs multi-round variants to surface short-term vs long-term strategy differences.

Add a small subset of open-source models (e.g., Phoenix, Baichuan) to your evaluation shortlist—they can beat commercial models in some games.

Agent Features

Memory

  • short-term memory and history tracking (record past rounds)

Planning

  • multi-round planning and opponent modeling

Frameworks

  • LLMsPark agent framework

Is Agentic

true

Architectures

  • Perception-Brain-Action generic agent architecture

Collaboration

  • multi-agent interaction and social reasoning

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Some models use rigid strategies and fail to adapt to novel scenarios.
  • Latency and compute cost hinder fast-paced game performance.
  • Models often rely on known tactics and rarely explore novel strategies.

When Not To Use

  • For single-turn classification or standard QA tasks where static benchmarks suffice.
  • For multimodal games; LLMsPark targets text-only agent interactions.
  • When compute budget cannot support concurrent multi-model evaluations.

Failure Modes

  • Over-elaboration reveals a spy role, hurting social-deduction performance.
  • Safety-aligned default behavior leads to predictable, suboptimal short-term moves.
  • Strategic rigidity causes consistent losses against adaptive opponents.

Core Entities

Models

  • Baichuan-7B
  • Baichuan2-7B-Chat
  • Phoenix-Inst-Chat-7B
  • ChatGLM-6B
  • ChatGLM2-6B
  • ChatGLM-Pro
  • ChatYuan-Large-v2
  • SFT
  • Dolly-v2-12B
  • CharacterGLM
  • GPT-3.5-Turbo
  • GPT-4
  • RWKV-4-World-7B
  • MiniMax-abab5-Chat
  • Qwen-14B-Chat

Metrics

  • Elo-style rating (K=32)
  • Per-game score (game-specific payoff)
  • Win/lose/draw outcomes

Datasets

  • Prisoner's Dilemma (simulated games)
  • Trust Game (simulated games)
  • Nim (simulated games)
  • Dictator Game (simulated games)
  • Who Is Spy (simulated social deduction games)

Benchmarks

  • LLMsPark