Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
LLMsPark helps pick the right model for interactive or strategic applications because models differ by game-style strength; test candidate models in scenario-like games before deployment.
Summary TLDR
LLMsPark is a public, text-only benchmark that runs LLMs as autonomous players in five classic games (Prisoner's Dilemma, Trust, Nim, Dictator, Who Is Spy). It evaluates strategic and social behaviors (cooperation, deception, leadership) using per-game scores and an Elo-style ranking (K=32). Results on 15 models show no single winner: GPT-4 leads multi-round Prisoner's Dilemma and Trust; Qwen-14B-Chat tops Who Is Spy; other open-source models sometimes beat commercial ones. The platform and results are available at the project site.
Problem Statement
Most LLM benchmarks test static knowledge or single-turn tasks and miss multi-agent, interactive strategic behavior. Teams need a reproducible, game-based testbed to measure planning, deception, trust, and long-term strategy across models.
Main Contribution
A game-theoretic benchmark (LLMsPark) that runs LLMs as autonomous players across five classic games.
Behavioral analysis that categorizes emergent strategies (trust, confrontation, pretense, leadership, deception).
Public release of the platform, model evaluations, and leaderboards at the project website.
Key Findings
GPT-4 performed best in multi-round Prisoner's Dilemma and Trust Game.
Qwen-14B-Chat achieved top performance on the social-deduction game Who Is Spy.
No model is uniformly best; different models top different games.
Some open-source models outperform or rival commercial models in specific settings.
Only a subset of models handled the complexity of Who Is Spy.
Results
Prisoner's Dilemma (Multi) - GPT-4
Who Is Spy - Qwen-14B-Chat
Prisoner's Dilemma (Single) - Phoenix-Inst-Chat-7B
Elo update K
Who Should Care
What To Try In 7 Days
Run LLMsPark or run the Who Is Spy test on candidate models to check multi-turn reasoning and deception handling.
Compare models in single-round vs multi-round variants to surface short-term vs long-term strategy differences.
Add a small subset of open-source models (e.g., Phoenix, Baichuan) to your evaluation shortlist—they can beat commercial models in some games.
Agent Features
Memory
- short-term memory and history tracking (record past rounds)
Planning
- multi-round planning and opponent modeling
Frameworks
- LLMsPark agent framework
Is Agentic
true
Architectures
- Perception-Brain-Action generic agent architecture
Collaboration
- multi-agent interaction and social reasoning
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Some models use rigid strategies and fail to adapt to novel scenarios.
- Latency and compute cost hinder fast-paced game performance.
- Models often rely on known tactics and rarely explore novel strategies.
When Not To Use
- For single-turn classification or standard QA tasks where static benchmarks suffice.
- For multimodal games; LLMsPark targets text-only agent interactions.
- When compute budget cannot support concurrent multi-model evaluations.
Failure Modes
- Over-elaboration reveals a spy role, hurting social-deduction performance.
- Safety-aligned default behavior leads to predictable, suboptimal short-term moves.
- Strategic rigidity causes consistent losses against adaptive opponents.
Core Entities
Models
- Baichuan-7B
- Baichuan2-7B-Chat
- Phoenix-Inst-Chat-7B
- ChatGLM-6B
- ChatGLM2-6B
- ChatGLM-Pro
- ChatYuan-Large-v2
- SFT
- Dolly-v2-12B
- CharacterGLM
- GPT-3.5-Turbo
- GPT-4
- RWKV-4-World-7B
- MiniMax-abab5-Chat
- Qwen-14B-Chat
Metrics
- Elo-style rating (K=32)
- Per-game score (game-specific payoff)
- Win/lose/draw outcomes
Datasets
- Prisoner's Dilemma (simulated games)
- Trust Game (simulated games)
- Nim (simulated games)
- Dictator Game (simulated games)
- Who Is Spy (simulated social deduction games)
Benchmarks
- LLMsPark

