Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
SmartPlay gives a quick, standardized way to test LLMs on interactive tasks that matter to automation: planning, handling randomness, and navigation—use it to find failure modes before deploying agents.
Summary TLDR
SmartPlay is a released benchmark and API that turns six games (Bandits, Rock-Paper-Scissors, Tower of Hanoi, Messenger, Crafter, simplified Minecraft) into text-based agent tasks. It defines 9 agent capabilities (planning, learning from interactions, spatial reasoning, etc.) and automated metrics (reward, completion rate, score). Experiments show GPT-4 variants lead other LLMs but still fall well short of human performance on complex tasks (big gaps on Crafter, Hanoi, Minecraft). Use SmartPlay to stress-test agent-like behaviors such as long-horizon planning, handling randomness, and spatial navigation.
Problem Statement
There is no standard, interactive benchmark that measures how well LLMs act as agents in environments with planning, randomness, spatial layout, and learning from interactions. Existing LLM tests focus on static reasoning or conversation, leaving a gap for agent evaluation.
Main Contribution
Public benchmark (SmartPlay) converting six games into text-based agent tasks with a unified OpenAI Gym API.
A capability taxonomy of 9 skills (e.g., planning, spatial reasoning, learning from interactions) and per-game difficulty grading.
Automated metrics (reward, completion rate, score) and recommended evaluation protocol.
Evaluation of 9 popular LLMs showing clear gaps between models and humans and highlighting weak skills (planning, spatial reasoning, error handling).
Released code and examples at github.com/microsoft/SmartPlay
Key Findings
GPT-4 variants outperform other LLMs on SmartPlay games.
State-of-the-art LLMs still lag humans on complex agent tasks.
Open-source LLMs perform much worse than GPT-4 variants.
Spatial reasoning is a consistent weakness across models.
Results
GPT-4-0613 normalized score on Crafter
GPT-4-0314 normalized score on Minecraft
GPT-4-0613 normalized score on Bandits
llama-2-13b normalized score on Bandits
Who Should Care
What To Try In 7 Days
Run SmartPlay on your current LLMs and compare Crafter and Minecraft performance to spot planning or spatial weaknesses.
Log and visualize action histories to detect forgetting and contradictory navigation behaviors.
Add simple state tracking (short-term memory) or action filters and re-run targeted games to measure improvement.
Agent Features
Memory
- short-term history tracking (rollout history)
Planning
- in-context planning
- long-horizon planning
Tool Use
- action selection via environment API
Frameworks
- OpenAI Gym-style environment loop
Is Agentic
true
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Visual tasks are simplified into text descriptions, which loses low-level perception detail.
- Manuals/context strings sometimes omit necessary crafting details (Crafter), creating partial observability.
- Benchmark covers a limited set of games and may not capture every real-world agent scenario.
- Evaluation depends on human-normalized baselines collected by the authors.
When Not To Use
- When you need pixel-level vision or continuous low-level motor control benchmarking.
- As the sole validation for safety-critical autonomous systems.
- If you require end-to-end embodied agents with learned low-level controllers.
Failure Modes
- Forgetting intermediate world state and giving contradictory navigation commands.
- Hallucinating actions or mis-parsing manuals leading to invalid moves.
- Poor recovery from mistakes that require multi-step re-planning.
Core Entities
Models
- GPT-4-0613
- GPT-4-0314
- text-davinci-003
- Claude
- Bard
- llama-2-13b
- llama-13b
- vicuna-13b
Metrics
- reward
- completion rate
- score
Datasets
- SmartPlay games (Bandits, RPS, Hanoi, Messenger, Crafter, Minecraft)
Benchmarks
- SmartPlay
- BanditTwoArmedHighLowFixed-v0
- RockPaperScissorBasic-v0
- Hanoi3Disk-v0
- MessengerL1-v0
- MessengerL2-v0
- Crafter-v0
- MinedojoCreative0-v0

