How LLMs are being used to build game-playing agents: memory, reasoning, perception, and multi-agent design

April 2, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

6

Authors

Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, Ling Liu

Links

Abstract / PDF

Why It Matters For Business

Game agents are a practical lab for building interactive AI: solutions for memory, robust reasoning, and hybrid control transfer to real automation, simulations, and multi-agent coordination systems used in product testing and virtual worlds.

Summary TLDR

This paper surveys research that uses large language models (LLMs) as the brain of game agents. It proposes a compact reference architecture with three single-agent modules—working & long-term memory, reasoning, and perception-action interfaces—and a complementary multi-agent layer for communication and organization. The authors map six game genres to concrete agent requirements and summarize practical techniques (context-extension, compression, chain-of-thought variants, reflective loops, code-as-policy, hybrid LLM+low-level controllers). The survey flags latency, memory structuring, and evaluation gaps as the main engineering hurdles.

Problem Statement

LLMs are powerful at language but are trained on static text and lack mechanisms for continuous, grounded interaction. Games provide a reproducible, diverse testbed for building and testing interactive LLM-based agents, but agent design is fragmented: how to add memory, reliable reasoning, perception–action grounding, and scalable multi-agent coordination remains unclear.

Main Contribution

A unified reference architecture for LLM-based game agents: memory, reasoning, perception-action interfaces, and a multi-agent extension for communication and organization.

A challenge-centered taxonomy linking six game genres (action, adventure, role-playing, strategy, simulation, sandbox) to concrete agent design requirements.

A synthesis of practical techniques: context-extension, memory compression/structuring, chain-of-thought variants, reflective learning, code-as-policy, and hybrid controllers.

A curated, up-to-date bibliography and a public paper list to track fast-moving work.

Key Findings

Carrying the previous step's thought into the next prompt (LastThoughts) raises win rate and cuts short-term inconsistent actions.

NumbersWin rate 0.4217 → 0.4667; consecutive switch rate 0.2442 → 0.0861

Position and attention tricks extend LLM context windows by orders of magnitude.

NumbersPI ~32K tokens; YaRN ~128K tokens; LongRoPE ~2M tokens

LLMs alone struggle at frame-rate, low-latency control tasks and often underperform RL agents in those settings.

Large multi-agent societies can emerge but require architectural and social priors to scale.

NumbersProject Sid ran hundreds to thousands of agents

Results

Win Rate (PokéLLMon)

ValueLLM (GPT-4o) 0.4217; LastThoughts 0.4667

BaselineLLM (GPT-4o) 0.4217

Consecutive Switch Rate (short-term consistency)

ValueLLM 0.2442; LastThoughts 0.0861

BaselineLLM 0.2442

Context window capacity (examples)

ValuePI ~32K; YaRN ~128K; LongRoPE ~2M tokens

Baselinestandard LLM windows (e.g., few to 32K)

Who Should Care

What To Try In 7 Days

Add a simple step-to-step thought carryover (LastThoughts) to reduce inconsistent actions.

Implement a small long-term store (vector DB) plus importance-based write-back for episodic memory.

Separate high-level LLM planning from a low-level controller for latency-sensitive tasks and compare win rates.

Agent Features

Memory

  • working memory (context extension, compression, active maintenance)
  • long-term memory (vector DB, key-value, tree/graph, parametric)
  • consolidation and importance scoring

Planning

  • Chain-of-Thought
  • search-based reasoning (Self-Consistency, Tree-of-Thoughts)
  • reflective reasoning
  • hierarchical planning

Tool Use

  • code-as-policy
  • API / programmatic actions
  • skill libraries (reusable primitives)

Frameworks

  • LLMGA single-agent framework
  • Multi-LLMGA communication + organizational framework

Is Agentic

true

Architectures

  • LLM-centered
  • hybrid LLM + low-level controller
  • multimodal LLMs

Collaboration

  • communication protocols (observations, beliefs, intentions)
  • organizational topology (centralized, decentralized, hierarchical)
  • task and role allocation (prefixed, dynamic, emergent)

Optimization Features

Token Efficiency

  • memory compression
  • context compression (gist tokens, soft token compression)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmarks are often templated and shallow, limiting tests of true open-ended generalization.
  • Many demonstrations require heavy compute and sequential API calls, raising cost and latency barriers.
  • Empirical comparisons are fragmented; methods are evaluated on different environments and with varying metrics.

When Not To Use

  • When you need strict real-time, frame-level control solely from an LLM (use hybrid controllers instead).
  • When the task cannot tolerate LLM hallucinations or inconsistent persona behavior without strong verification.
  • When you need fully reproducible large-scale multi-agent dynamics without investing in custom infrastructure.

Failure Modes

  • Short-term decision inconsistency (action flip-flopping) without active maintenance.
  • Role drift over long dialogues or multi-episode play unless role memory is enforced.
  • Latency bottlenecks in action games that turn stronger reasoning into worse performance.
  • Scaling collapse in multi-agent systems when single-threaded cognition becomes the bottleneck.

Core Entities

Models

  • GPT-4o
  • GPT-3.5
  • GPT-4 Vision
  • multimodal LLMs (general mention)

Metrics

  • win rate
  • task success rate
  • consecutive switch rate
  • map coverage
  • number of unique items collected

Datasets

  • TextWorld
  • Jericho
  • ALFWorld
  • ScienceWorld
  • MineDojo
  • Minecraft (various platforms)
  • Crafter
  • Atari 2600
  • StarCraft II
  • PokéLLMon
  • AvalonBench
  • PokerBench
  • ChessGPT datasets

Benchmarks

  • ALFWorld
  • TextWorld
  • MineDojo
  • Crafter
  • PokéLLMon
  • llm-colosseum (Street Fighter)