Overview
The benchmark fills a gap by testing interactive, multi-agent strategic behaviors using standard games and Elo ratings; results show clear, game-specific strengths but require compute for parallel LLM runs.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
LLMsPark helps pick the right model for interactive or strategic applications because models differ by game-style strength; test candidate models in scenario-like games before deployment.
Who Should Care
Summary TLDR
LLMsPark is a public, text-only benchmark that runs LLMs as autonomous players in five classic games (Prisoner's Dilemma, Trust, Nim, Dictator, Who Is Spy). It evaluates strategic and social behaviors (cooperation, deception, leadership) using per-game scores and an Elo-style ranking (K=32). Results on 15 models show no single winner: GPT-4 leads multi-round Prisoner's Dilemma and Trust; Qwen-14B-Chat tops Who Is Spy; other open-source models sometimes beat commercial ones. The platform and results are available at the project site.
Problem Statement
Most LLM benchmarks test static knowledge or single-turn tasks and miss multi-agent, interactive strategic behavior. Teams need a reproducible, game-based testbed to measure planning, deception, trust, and long-term strategy across models.
Main Contribution
A game-theoretic benchmark (LLMsPark) that runs LLMs as autonomous players across five classic games.
Behavioral analysis that categorizes emergent strategies (trust, confrontation, pretense, leadership, deception).
Key Findings
GPT-4 performed best in multi-round Prisoner's Dilemma and Trust Game.
Qwen-14B-Chat achieved top performance on the social-deduction game Who Is Spy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Prisoner's Dilemma (Multi) - GPT-4 | 1285.80 | — | — | Prisoner's Dilemma (multi-round) | Table 1 reports GPT-4 PD (Multi) = 1285.80 | Table 1 |
| Who Is Spy - Qwen-14B-Chat | 77 | — | — | Who Is Spy (multi-player social deduction) | Table 1 lists Who Is Spy score = 77 for Qwen-14B-Chat (highest) | Table 1 |
What To Try In 7 Days
Run LLMsPark or run the Who Is Spy test on candidate models to check multi-turn reasoning and deception handling.
Compare models in single-round vs multi-round variants to surface short-term vs long-term strategy differences.
Add a small subset of open-source models (e.g., Phoenix, Baichuan) to your evaluation shortlist—they can beat commercial models in some games.
Agent Features
Memory
Planning
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Some models use rigid strategies and fail to adapt to novel scenarios.
Latency and compute cost hinder fast-paced game performance.
When Not To Use
For single-turn classification or standard QA tasks where static benchmarks suffice.
For multimodal games; LLMsPark targets text-only agent interactions.
Failure Modes
Over-elaboration reveals a spy role, hurting social-deduction performance.
Safety-aligned default behavior leads to predictable, suboptimal short-term moves.

