SmartPlay: a multi-game benchmark to test LLMs as interactive agents

October 2, 20236 min

Overview

Decision SnapshotNeeds Validation

SmartPlay is ready for research and pre-production stress tests; it highlights real gaps but is not a turnkey solution for production agent deployments.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yue Wu, Xuan Tang, Tom M. Mitchell, Yuanzhi Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SmartPlay gives a quick, standardized way to test LLMs on interactive tasks that matter to automation: planning, handling randomness, and navigation—use it to find failure modes before deploying agents.

Who Should Care

Summary TLDR

SmartPlay is a released benchmark and API that turns six games (Bandits, Rock-Paper-Scissors, Tower of Hanoi, Messenger, Crafter, simplified Minecraft) into text-based agent tasks. It defines 9 agent capabilities (planning, learning from interactions, spatial reasoning, etc.) and automated metrics (reward, completion rate, score). Experiments show GPT-4 variants lead other LLMs but still fall well short of human performance on complex tasks (big gaps on Crafter, Hanoi, Minecraft). Use SmartPlay to stress-test agent-like behaviors such as long-horizon planning, handling randomness, and spatial navigation.

Problem Statement

There is no standard, interactive benchmark that measures how well LLMs act as agents in environments with planning, randomness, spatial layout, and learning from interactions. Existing LLM tests focus on static reasoning or conversation, leaving a gap for agent evaluation.

Main Contribution

Public benchmark (SmartPlay) converting six games into text-based agent tasks with a unified OpenAI Gym API.

A capability taxonomy of 9 skills (e.g., planning, spatial reasoning, learning from interactions) and per-game difficulty grading.

Key Findings

GPT-4 variants outperform other LLMs on SmartPlay games.

Numbers>20% gap vs other proprietary models on most games

Practical UseFor agent-style tasks, prefer GPT-4-class models for best out-of-the-box behavior but expect limits on harder domains.

Evidence RefSection 5.1, Table 2

State-of-the-art LLMs still lag humans on complex agent tasks.

NumbersHuman minus GPT-4: Hanoi ~10%, Minecraft ~40%, Crafter ~70%

Practical UseDo not rely solely on current LLMs for long-horizon planning or 3D navigation; combine with planning modules or specialized control.

Evidence RefSection 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4-0613 normalized score on Crafter0.26 (human=1.0)Human baseline = 1.0-0.74Crafter-v0Table 2: GPT-4-0613 = 0.26Section 5.1, Table 2
GPT-4-0314 normalized score on Minecraft0.59 (human=1.0)Human baseline = 1.0-0.41MinedojoCreative0-v0Table 2: GPT-4-0314 = 0.59Section 5.1, Table 2

What To Try In 7 Days

Run SmartPlay on your current LLMs and compare Crafter and Minecraft performance to spot planning or spatial weaknesses.

Log and visualize action histories to detect forgetting and contradictory navigation behaviors.

Add simple state tracking (short-term memory) or action filters and re-run targeted games to measure improvement.

Agent Features

Memory
short-term history tracking (rollout history)
Planning
in-context planninglong-horizon planning
Tool Use
action selection via environment API
Frameworks
OpenAI Gym-style environment loop
Is Agentic

Yes

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Visual tasks are simplified into text descriptions, which loses low-level perception detail.

Manuals/context strings sometimes omit necessary crafting details (Crafter), creating partial observability.

When Not To Use

When you need pixel-level vision or continuous low-level motor control benchmarking.

As the sole validation for safety-critical autonomous systems.

Failure Modes

Forgetting intermediate world state and giving contradictory navigation commands.

Hallucinating actions or mis-parsing manuals leading to invalid moves.

Core Entities

Models

GPT-4-0613GPT-4-0314text-davinci-003ClaudeBardllama-2-13bllama-13bvicuna-13b

Metrics

rewardcompletion ratescore

Datasets

SmartPlay games (Bandits, RPS, Hanoi, Messenger, Crafter, Minecraft)

Benchmarks

SmartPlayBanditTwoArmedHighLowFixed-v0RockPaperScissorBasic-v0Hanoi3Disk-v0MessengerL1-v0MessengerL2-v0Crafter-v0MinedojoCreative0-v0