SmartPlay: a multi-game benchmark to test LLMs as interactive agents

Overview

Decision SnapshotNeeds Validation

SmartPlay is ready for research and pre-production stress tests; it highlights real gaps but is not a turnkey solution for production agent deployments.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yue Wu, Xuan Tang, Tom M. Mitchell, Yuanzhi Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SmartPlay gives a quick, standardized way to test LLMs on interactive tasks that matter to automation: planning, handling randomness, and navigation—use it to find failure modes before deploying agents.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

SmartPlay is a released benchmark and API that turns six games (Bandits, Rock-Paper-Scissors, Tower of Hanoi, Messenger, Crafter, simplified Minecraft) into text-based agent tasks. It defines 9 agent capabilities (planning, learning from interactions, spatial reasoning, etc.) and automated metrics (reward, completion rate, score). Experiments show GPT-4 variants lead other LLMs but still fall well short of human performance on complex tasks (big gaps on Crafter, Hanoi, Minecraft). Use SmartPlay to stress-test agent-like behaviors such as long-horizon planning, handling randomness, and spatial navigation.

Problem Statement

There is no standard, interactive benchmark that measures how well LLMs act as agents in environments with planning, randomness, spatial layout, and learning from interactions. Existing LLM tests focus on static reasoning or conversation, leaving a gap for agent evaluation.

Main Contribution

Public benchmark (SmartPlay) converting six games into text-based agent tasks with a unified OpenAI Gym API.

A capability taxonomy of 9 skills (e.g., planning, spatial reasoning, learning from interactions) and per-game difficulty grading.

Key Findings

GPT-4 variants outperform other LLMs on SmartPlay games.

Numbers>20% gap vs other proprietary models on most games

Practical UseFor agent-style tasks, prefer GPT-4-class models for best out-of-the-box behavior but expect limits on harder domains.

Evidence RefSection 5.1, Table 2

State-of-the-art LLMs still lag humans on complex agent tasks.

NumbersHuman minus GPT-4: Hanoi ~10%, Minecraft ~40%, Crafter ~70%

Practical UseDo not rely solely on current LLMs for long-horizon planning or 3D navigation; combine with planning modules or specialized control.

Evidence RefSection 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4-0613 normalized score on Crafter	0.26 (human=1.0)	Human baseline = 1.0	-0.74	Crafter-v0	Table 2: GPT-4-0613 = 0.26	Section 5.1, Table 2
GPT-4-0314 normalized score on Minecraft	0.59 (human=1.0)	Human baseline = 1.0	-0.41	MinedojoCreative0-v0	Table 2: GPT-4-0314 = 0.59	Section 5.1, Table 2

What To Try In 7 Days

Run SmartPlay on your current LLMs and compare Crafter and Minecraft performance to spot planning or spatial weaknesses.

Log and visualize action histories to detect forgetting and contradictory navigation behaviors.

Add simple state tracking (short-term memory) or action filters and re-run targeted games to measure improvement.

Agent Features

Memory

short-term history tracking (rollout history)

Planning

in-context planninglong-horizon planning

Tool Use

action selection via environment API

Frameworks

OpenAI Gym-style environment loop

Is Agentic

Yes

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/microsoft/SmartPlay

Data URLs

https://github.com/microsoft/SmartPlay

Risks & Boundaries

Limitations

Visual tasks are simplified into text descriptions, which loses low-level perception detail.

Manuals/context strings sometimes omit necessary crafting details (Crafter), creating partial observability.

When Not To Use

When you need pixel-level vision or continuous low-level motor control benchmarking.

As the sole validation for safety-critical autonomous systems.

Failure Modes

Forgetting intermediate world state and giving contradictory navigation commands.

Hallucinating actions or mis-parsing manuals leading to invalid moves.

Core Entities

Models

GPT-4-0613GPT-4-0314text-davinci-003ClaudeBardllama-2-13bllama-13bvicuna-13b

Metrics

rewardcompletion ratescore

Datasets

SmartPlay games (Bandits, RPS, Hanoi, Messenger, Crafter, Minecraft)

Benchmarks

SmartPlayBanditTwoArmedHighLowFixed-v0RockPaperScissorBasic-v0Hanoi3Disk-v0MessengerL1-v0MessengerL2-v0Crafter-v0MinedojoCreative0-v0

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 variants outperform other LLMs on SmartPlay games.

State-of-the-art LLMs still lag humans on complex agent tasks.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding