Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
6
Why It Matters For Business
Game agents are a practical lab for building interactive AI: solutions for memory, robust reasoning, and hybrid control transfer to real automation, simulations, and multi-agent coordination systems used in product testing and virtual worlds.
Summary TLDR
This paper surveys research that uses large language models (LLMs) as the brain of game agents. It proposes a compact reference architecture with three single-agent modules—working & long-term memory, reasoning, and perception-action interfaces—and a complementary multi-agent layer for communication and organization. The authors map six game genres to concrete agent requirements and summarize practical techniques (context-extension, compression, chain-of-thought variants, reflective loops, code-as-policy, hybrid LLM+low-level controllers). The survey flags latency, memory structuring, and evaluation gaps as the main engineering hurdles.
Problem Statement
LLMs are powerful at language but are trained on static text and lack mechanisms for continuous, grounded interaction. Games provide a reproducible, diverse testbed for building and testing interactive LLM-based agents, but agent design is fragmented: how to add memory, reliable reasoning, perception–action grounding, and scalable multi-agent coordination remains unclear.
Main Contribution
A unified reference architecture for LLM-based game agents: memory, reasoning, perception-action interfaces, and a multi-agent extension for communication and organization.
A challenge-centered taxonomy linking six game genres (action, adventure, role-playing, strategy, simulation, sandbox) to concrete agent design requirements.
A synthesis of practical techniques: context-extension, memory compression/structuring, chain-of-thought variants, reflective learning, code-as-policy, and hybrid controllers.
A curated, up-to-date bibliography and a public paper list to track fast-moving work.
Key Findings
Carrying the previous step's thought into the next prompt (LastThoughts) raises win rate and cuts short-term inconsistent actions.
Position and attention tricks extend LLM context windows by orders of magnitude.
LLMs alone struggle at frame-rate, low-latency control tasks and often underperform RL agents in those settings.
Large multi-agent societies can emerge but require architectural and social priors to scale.
Results
Win Rate (PokéLLMon)
Consecutive Switch Rate (short-term consistency)
Context window capacity (examples)
Who Should Care
What To Try In 7 Days
Add a simple step-to-step thought carryover (LastThoughts) to reduce inconsistent actions.
Implement a small long-term store (vector DB) plus importance-based write-back for episodic memory.
Separate high-level LLM planning from a low-level controller for latency-sensitive tasks and compare win rates.
Agent Features
Memory
- working memory (context extension, compression, active maintenance)
- long-term memory (vector DB, key-value, tree/graph, parametric)
- consolidation and importance scoring
Planning
- Chain-of-Thought
- search-based reasoning (Self-Consistency, Tree-of-Thoughts)
- reflective reasoning
- hierarchical planning
Tool Use
- code-as-policy
- API / programmatic actions
- skill libraries (reusable primitives)
Frameworks
- LLMGA single-agent framework
- Multi-LLMGA communication + organizational framework
Is Agentic
true
Architectures
- LLM-centered
- hybrid LLM + low-level controller
- multimodal LLMs
Collaboration
- communication protocols (observations, beliefs, intentions)
- organizational topology (centralized, decentralized, hierarchical)
- task and role allocation (prefixed, dynamic, emergent)
Optimization Features
Token Efficiency
- memory compression
- context compression (gist tokens, soft token compression)
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmarks are often templated and shallow, limiting tests of true open-ended generalization.
- Many demonstrations require heavy compute and sequential API calls, raising cost and latency barriers.
- Empirical comparisons are fragmented; methods are evaluated on different environments and with varying metrics.
When Not To Use
- When you need strict real-time, frame-level control solely from an LLM (use hybrid controllers instead).
- When the task cannot tolerate LLM hallucinations or inconsistent persona behavior without strong verification.
- When you need fully reproducible large-scale multi-agent dynamics without investing in custom infrastructure.
Failure Modes
- Short-term decision inconsistency (action flip-flopping) without active maintenance.
- Role drift over long dialogues or multi-episode play unless role memory is enforced.
- Latency bottlenecks in action games that turn stronger reasoning into worse performance.
- Scaling collapse in multi-agent systems when single-threaded cognition becomes the bottleneck.
Core Entities
Models
- GPT-4o
- GPT-3.5
- GPT-4 Vision
- multimodal LLMs (general mention)
Metrics
- win rate
- task success rate
- consecutive switch rate
- map coverage
- number of unique items collected
Datasets
- TextWorld
- Jericho
- ALFWorld
- ScienceWorld
- MineDojo
- Minecraft (various platforms)
- Crafter
- Atari 2600
- StarCraft II
- PokéLLMon
- AvalonBench
- PokerBench
- ChessGPT datasets
Benchmarks
- ALFWorld
- TextWorld
- MineDojo
- Crafter
- PokéLLMon
- llm-colosseum (Street Fighter)

