How LLMs are being used to build game-playing agents: memory, reasoning, perception, and multi-agent design

April 2, 20248 min

Overview

Decision SnapshotNeeds Validation

The survey aggregates many prototype systems and reproducible benchmarks. Patterns are clear, but most methods remain at experimental or research-grade maturity; productization needs engineering for latency, memory scaling, and robust evaluation.

Citations6

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, Ling Liu

Links

Abstract / PDF / Code

Why It Matters For Business

Game agents are a practical lab for building interactive AI: solutions for memory, robust reasoning, and hybrid control transfer to real automation, simulations, and multi-agent coordination systems used in product testing and virtual worlds.

Who Should Care

Summary TLDR

This paper surveys research that uses large language models (LLMs) as the brain of game agents. It proposes a compact reference architecture with three single-agent modules—working & long-term memory, reasoning, and perception-action interfaces—and a complementary multi-agent layer for communication and organization. The authors map six game genres to concrete agent requirements and summarize practical techniques (context-extension, compression, chain-of-thought variants, reflective loops, code-as-policy, hybrid LLM+low-level controllers). The survey flags latency, memory structuring, and evaluation gaps as the main engineering hurdles.

Problem Statement

LLMs are powerful at language but are trained on static text and lack mechanisms for continuous, grounded interaction. Games provide a reproducible, diverse testbed for building and testing interactive LLM-based agents, but agent design is fragmented: how to add memory, reliable reasoning, perception–action grounding, and scalable multi-agent coordination remains unclear.

Main Contribution

A unified reference architecture for LLM-based game agents: memory, reasoning, perception-action interfaces, and a multi-agent extension for communication and organization.

A challenge-centered taxonomy linking six game genres (action, adventure, role-playing, strategy, simulation, sandbox) to concrete agent design requirements.

Key Findings

Carrying the previous step's thought into the next prompt (LastThoughts) raises win rate and cuts short-term inconsistent actions.

NumbersWin rate 0.42170.4667; consecutive switch rate 0.24420.0861

Practical UseImplement a lightweight step-to-step trace (carry last thought) to stabilize decisions and improve win rates in turn-based games.

Evidence RefTable 2 (PokéLLMon)

Position and attention tricks extend LLM context windows by orders of magnitude.

NumbersPI ~32K tokens; YaRN ~128K tokens; LongRoPE ~2M tokens

Practical UseUse positional interpolation or progressive RoPE scaling to handle much longer game histories without full model retraining.

Evidence RefSection 3.1 (Context Extension)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Win Rate (PokéLLMon)LLM (GPT-4o) 0.4217; LastThoughts 0.4667LLM (GPT-4o) 0.4217+0.0449 (relative +10.6%)PokéLLMon BattlesTable 2 reports per-method win ratesTable 2 (PokéLLMon)
Consecutive Switch Rate (short-term consistency)LLM 0.2442; LastThoughts 0.0861LLM 0.2442-0.1581 (relative -64.7%)PokéLLMon BattlesTable 2 measures consecutive switches as a proxy for instabilityTable 2 (PokéLLMon)

What To Try In 7 Days

Add a simple step-to-step thought carryover (LastThoughts) to reduce inconsistent actions.

Implement a small long-term store (vector DB) plus importance-based write-back for episodic memory.

Separate high-level LLM planning from a low-level controller for latency-sensitive tasks and compare win rates.

Agent Features

Memory
working memory (context extension, compression, active maintenance)long-term memory (vector DB, key-value, tree/graph, parametric)consolidation and importance scoring
Planning
Chain-of-Thoughtsearch-based reasoning (Self-Consistency, Tree-of-Thoughts)reflective reasoninghierarchical planning
Tool Use
code-as-policyAPI / programmatic actionsskill libraries (reusable primitives)
Frameworks
LLMGA single-agent frameworkMulti-LLMGA communication + organizational framework
Is Agentic

Yes

Architectures
LLM-centeredhybrid LLM + low-level controllermultimodal LLMs
Collaboration
communication protocols (observations, beliefs, intentions)organizational topology (centralized, decentralized, hierarchical)task and role allocation (prefixed, dynamic, emergent)

Optimization Features

Token Efficiency
memory compressioncontext compression (gist tokens, soft token compression)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmarks are often templated and shallow, limiting tests of true open-ended generalization.

Many demonstrations require heavy compute and sequential API calls, raising cost and latency barriers.

When Not To Use

When you need strict real-time, frame-level control solely from an LLM (use hybrid controllers instead).

When the task cannot tolerate LLM hallucinations or inconsistent persona behavior without strong verification.

Failure Modes

Short-term decision inconsistency (action flip-flopping) without active maintenance.

Role drift over long dialogues or multi-episode play unless role memory is enforced.

Core Entities

Models

GPT-4oGPT-3.5GPT-4 Visionmultimodal LLMs (general mention)

Metrics

win ratetask success rateconsecutive switch ratemap coveragenumber of unique items collected

Datasets

TextWorldJerichoALFWorldScienceWorldMineDojoMinecraft (various platforms)CrafterAtari 2600StarCraft IIPokéLLMonAvalonBenchPokerBenchChessGPT datasets

Benchmarks

ALFWorldTextWorldMineDojoCrafterPokéLLMonllm-colosseum (Street Fighter)