LLMsPark: a game-theory benchmark that tests LLMs as strategic, social agents

Overview

Decision SnapshotNeeds Validation

The benchmark fills a gap by testing interactive, multi-agent strategic behaviors using standard games and Elo ratings; results show clear, game-specific strengths but require compute for parallel LLM runs.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Junhao Chen, Jingbo Sun, Xiang Li, Haidong Xin, Yuhao Xue, Yibin Xu, Hao Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMsPark helps pick the right model for interactive or strategic applications because models differ by game-style strength; test candidate models in scenario-like games before deployment.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead

Summary TLDR

LLMsPark is a public, text-only benchmark that runs LLMs as autonomous players in five classic games (Prisoner's Dilemma, Trust, Nim, Dictator, Who Is Spy). It evaluates strategic and social behaviors (cooperation, deception, leadership) using per-game scores and an Elo-style ranking (K=32). Results on 15 models show no single winner: GPT-4 leads multi-round Prisoner's Dilemma and Trust; Qwen-14B-Chat tops Who Is Spy; other open-source models sometimes beat commercial ones. The platform and results are available at the project site.

Problem Statement

Most LLM benchmarks test static knowledge or single-turn tasks and miss multi-agent, interactive strategic behavior. Teams need a reproducible, game-based testbed to measure planning, deception, trust, and long-term strategy across models.

Main Contribution

A game-theoretic benchmark (LLMsPark) that runs LLMs as autonomous players across five classic games.

Behavioral analysis that categorizes emergent strategies (trust, confrontation, pretense, leadership, deception).

Key Findings

GPT-4 performed best in multi-round Prisoner's Dilemma and Trust Game.

NumbersPrisoner's Dilemma (Multi) score = 1285.80; Trust (Multi) = 1247.80

Practical UseUse GPT-4 when you need long-horizon social strategy and trust-building in multi-round interactions.

Evidence RefTable 1

Qwen-14B-Chat achieved top performance on the social-deduction game Who Is Spy.

NumbersWho Is Spy score = 77 (highest among tested models)

Practical UseFor tasks needing subtle camouflage, selective voting, and multi-turn information filtering, test Qwen-14B-Chat as a candidate.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Prisoner's Dilemma (Multi) - GPT-4	1285.80	—	—	Prisoner's Dilemma (multi-round)	Table 1 reports GPT-4 PD (Multi) = 1285.80	Table 1
Who Is Spy - Qwen-14B-Chat	77	—	—	Who Is Spy (multi-player social deduction)	Table 1 lists Who Is Spy score = 77 for Qwen-14B-Chat (highest)	Table 1

What To Try In 7 Days

Run LLMsPark or run the Who Is Spy test on candidate models to check multi-turn reasoning and deception handling.

Compare models in single-round vs multi-round variants to surface short-term vs long-term strategy differences.

Add a small subset of open-source models (e.g., Phoenix, Baichuan) to your evaluation shortlist—they can beat commercial models in some games.

Agent Features

Memory

short-term memory and history tracking (record past rounds)

Planning

multi-round planning and opponent modeling

Frameworks

LLMsPark agent framework

Is Agentic

Yes

Architectures

Perception-Brain-Action generic agent architecture

Collaboration

multi-agent interaction and social reasoning

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://llmsparks.github.io/

Data URLs

https://llmsparks.github.io/

Risks & Boundaries

Limitations

Some models use rigid strategies and fail to adapt to novel scenarios.

Latency and compute cost hinder fast-paced game performance.

When Not To Use

For single-turn classification or standard QA tasks where static benchmarks suffice.

For multimodal games; LLMsPark targets text-only agent interactions.

Failure Modes

Over-elaboration reveals a spy role, hurting social-deduction performance.

Safety-aligned default behavior leads to predictable, suboptimal short-term moves.

Core Entities

Models

Baichuan-7BBaichuan2-7B-ChatPhoenix-Inst-Chat-7BChatGLM-6BChatGLM2-6BChatGLM-ProChatYuan-Large-v2SFTDolly-v2-12BCharacterGLMGPT-3.5-TurboGPT-4RWKV-4-World-7BMiniMax-abab5-ChatQwen-14B-Chat

Metrics

Elo-style rating (K=32)Per-game score (game-specific payoff)Win/lose/draw outcomes

Datasets

Prisoner's Dilemma (simulated games)Trust Game (simulated games)Nim (simulated games)Dictator Game (simulated games)Who Is Spy (simulated social deduction games)

Benchmarks

LLMsPark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 performed best in multi-round Prisoner's Dilemma and Trust Game.

Qwen-14B-Chat achieved top performance on the social-deduction game Who Is Spy.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding