Multi‑agent ReAct Game Master outperforms prompt‑only GM in solo RPGs

February 26, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Nicolai Hejlesen Jørgensen, Sarmilan Tharmabalan, Ilhan Aslan, Nicolai Brodersen Hansen, Timothy Merritt

Links

Abstract / PDF

Why It Matters For Business

An agentic ReAct design with a memory agent measurably raises player immersion, coherence, and replay intent; studios can add AI DMs that scale solo-play experiences and increase engagement.

Summary TLDR

The authors built ChatRPG: a text-based solo role-playing system and compared two Game Master designs. v1 used long prompt engineering with GPT‑4. v2 split responsibilities into two ReAct agents (Narrator + Archivist) that call JSON tools to act and update persistent state. A counterbalanced user study (N=12) shows v2 gave higher ratings for perceived intelligence, immersion, mastery, coherence, and curiosity. Code is published. Main limits: small sample, prompt sensitivity, and API content filters.

Problem Statement

Solo tabletop-style role-playing needs a dependable Game Master (GM). Simple prompt-only LLMs can produce engaging text but struggle with long-term coherence, state tracking, and complex actions. The paper asks whether an agentic, tool-enabled ReAct design with a dedicated memory agent improves player experience over a prompt-only approach.

Main Contribution

Built ChatRPG v1 (prompt-engineered, state-in-prompt) and v2 (multi-agent ReAct with tools and persistent state).

Designed two specialized agents: Narrator (story & actions) and Archivist (persistent game state/memory).

Evaluated both versions in a counterbalanced user study (N=12) with quantitative PXI-derived measures and qualitative interviews.

Released implementation and full prompts to help replication and adaptation.

Key Findings

Players rated the agentic v2 higher on multiple engagement measures.

NumbersN=12; 9/14 constructs significant; example: Mastery 0.68→2.33 (p=0.004)

v2 achieved statistically significant improvements in coherence and immersion.

NumbersCoherent story 1.00→2.25 (p=0.04); Immersion 1.64→2.42 (p=0.034)

Prompt-only v1 degrades as conversation context grows and hits model context limits.

ReAct design required careful few‑shot tool descriptions and prompt tuning; wording strongly affects behavior.

Results

Mastery (PXI)

Valuev1 mean=0.68, v2 mean=2.33

Baselinev1

Coherent story

Valuev1 mean=1.00, v2 mean=2.25

Baselinev1

Immersion (PXI)

Valuev1 mean=1.64, v2 mean=2.42

Baselinev1

Ease of control

Valuev1 mean=2.08, v2 mean=2.81

Baselinev1

Likely to play again

Valuev1 mean=1.58, v2 mean=2.50

Baselinev1

Overall satisfaction

Valuev1 mean=1.08, v2 mean=2.17

Baselinev1

Who Should Care

What To Try In 7 Days

Prototype a prompt-only GM to validate basic narrative style and UI.

Add a single background process that stores structured JSON state instead of stuffing history into prompts.

Implement one ReAct tool (e.g., battle resolution) and few-shot examples to test action/tool wiring with an LLM.

Agent Features

Memory

  • Archivist persistent JSON game state (structured memory)

Planning

  • ReAct: Thought-Action-Observation trajectories

Tool Use

  • JSON tool calls: Battle, WoundCharacter, HealCharacter, UpdateCharacter, UpdateEnvironment

Frameworks

  • LangChain (pattern referenced)
  • OpenAI API (LLM provider)

Is Agentic

true

Architectures

  • multi-agent (Narrator + Archivist)

Collaboration

  • separation of duties: narrator (narrative+actions) vs archivist (state updates)

Optimization Features

Token Efficiency

  • reduces token growth by storing structured state externally instead of entire conversation

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small sample size (N=12) limits statistical power and external validity.
  • Behavior is sensitive to prompt wording and few-shot examples; brittle tuning required.
  • Dependence on closed-source LLM (GPT‑4) and restrictive API content filters affects behavior.
  • Study participants were mostly young and experienced with games, limiting demographic diversity.

When Not To Use

  • When you need deterministic reproducibility across runs or exact reproducible outputs.
  • When content moderation will block core gameplay (e.g., violent fantasy) and no alternative model is available.
  • When you cannot afford the API cost or require fully offline deployment without model self-hosting.

Failure Modes

  • Hallucinations that introduce inconsistent facts or NPC traits.
  • Context window overflow if state is not properly summarized or persisted.
  • Incorrect tool invocation frequency leading to broken combat or state updates.
  • API moderation interruptions that abort or alter narratives mid-session.

Core Entities

Models

  • ChatGPT-4 (OpenAI)

Metrics

  • Player Experience Inventory (PXI) constructs: Ease of control, Mastery, Immersion, Curiosity, Cohere
  • Custom survey items (story interest, likely to play again, satisfaction)