SmartPlay: a multi-game benchmark to test LLMs as interactive agents

October 2, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Yue Wu, Xuan Tang, Tom M. Mitchell, Yuanzhi Li

Links

Abstract / PDF

Why It Matters For Business

SmartPlay gives a quick, standardized way to test LLMs on interactive tasks that matter to automation: planning, handling randomness, and navigation—use it to find failure modes before deploying agents.

Summary TLDR

SmartPlay is a released benchmark and API that turns six games (Bandits, Rock-Paper-Scissors, Tower of Hanoi, Messenger, Crafter, simplified Minecraft) into text-based agent tasks. It defines 9 agent capabilities (planning, learning from interactions, spatial reasoning, etc.) and automated metrics (reward, completion rate, score). Experiments show GPT-4 variants lead other LLMs but still fall well short of human performance on complex tasks (big gaps on Crafter, Hanoi, Minecraft). Use SmartPlay to stress-test agent-like behaviors such as long-horizon planning, handling randomness, and spatial navigation.

Problem Statement

There is no standard, interactive benchmark that measures how well LLMs act as agents in environments with planning, randomness, spatial layout, and learning from interactions. Existing LLM tests focus on static reasoning or conversation, leaving a gap for agent evaluation.

Main Contribution

Public benchmark (SmartPlay) converting six games into text-based agent tasks with a unified OpenAI Gym API.

A capability taxonomy of 9 skills (e.g., planning, spatial reasoning, learning from interactions) and per-game difficulty grading.

Automated metrics (reward, completion rate, score) and recommended evaluation protocol.

Evaluation of 9 popular LLMs showing clear gaps between models and humans and highlighting weak skills (planning, spatial reasoning, error handling).

Released code and examples at github.com/microsoft/SmartPlay

Key Findings

GPT-4 variants outperform other LLMs on SmartPlay games.

Numbers>20% gap vs other proprietary models on most games

State-of-the-art LLMs still lag humans on complex agent tasks.

NumbersHuman minus GPT-4: Hanoi ~10%, Minecraft ~40%, Crafter ~70%

Open-source LLMs perform much worse than GPT-4 variants.

NumbersOpen-source <50% of GPT-4 on simple tasks; ~1/8 on harder tasks

Spatial reasoning is a consistent weakness across models.

NumbersBest model reaches ~60% of human baseline on Minecraft

Results

GPT-4-0613 normalized score on Crafter

Value0.26 (human=1.0)

BaselineHuman baseline = 1.0

GPT-4-0314 normalized score on Minecraft

Value0.59 (human=1.0)

BaselineHuman baseline = 1.0

GPT-4-0613 normalized score on Bandits

Value1.00 (human=1.0)

BaselineHuman baseline = 1.0

llama-2-13b normalized score on Bandits

Value0.50 (human=1.0)

BaselineHuman baseline = 1.0

Who Should Care

What To Try In 7 Days

Run SmartPlay on your current LLMs and compare Crafter and Minecraft performance to spot planning or spatial weaknesses.

Log and visualize action histories to detect forgetting and contradictory navigation behaviors.

Add simple state tracking (short-term memory) or action filters and re-run targeted games to measure improvement.

Agent Features

Memory

  • short-term history tracking (rollout history)

Planning

  • in-context planning
  • long-horizon planning

Tool Use

  • action selection via environment API

Frameworks

  • OpenAI Gym-style environment loop

Is Agentic

true

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Visual tasks are simplified into text descriptions, which loses low-level perception detail.
  • Manuals/context strings sometimes omit necessary crafting details (Crafter), creating partial observability.
  • Benchmark covers a limited set of games and may not capture every real-world agent scenario.
  • Evaluation depends on human-normalized baselines collected by the authors.

When Not To Use

  • When you need pixel-level vision or continuous low-level motor control benchmarking.
  • As the sole validation for safety-critical autonomous systems.
  • If you require end-to-end embodied agents with learned low-level controllers.

Failure Modes

  • Forgetting intermediate world state and giving contradictory navigation commands.
  • Hallucinating actions or mis-parsing manuals leading to invalid moves.
  • Poor recovery from mistakes that require multi-step re-planning.

Core Entities

Models

  • GPT-4-0613
  • GPT-4-0314
  • text-davinci-003
  • Claude
  • Bard
  • llama-2-13b
  • llama-13b
  • vicuna-13b

Metrics

  • reward
  • completion rate
  • score

Datasets

  • SmartPlay games (Bandits, RPS, Hanoi, Messenger, Crafter, Minecraft)

Benchmarks

  • SmartPlay
  • BanditTwoArmedHighLowFixed-v0
  • RockPaperScissorBasic-v0
  • Hanoi3Disk-v0
  • MessengerL1-v0
  • MessengerL2-v0
  • Crafter-v0
  • MinedojoCreative0-v0