Multi‑agent ReAct Game Master outperforms prompt‑only GM in solo RPGs

February 26, 20256 min

Overview

Decision SnapshotNeeds Validation

The paper gives a practical, runnable system and measured user gains, but the small sample, prompt brittleness, and reliance on proprietary LLMs limit generalizability.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Nicolai Hejlesen Jørgensen, Sarmilan Tharmabalan, Ilhan Aslan, Nicolai Brodersen Hansen, Timothy Merritt

Links

Abstract / PDF / Code

Why It Matters For Business

An agentic ReAct design with a memory agent measurably raises player immersion, coherence, and replay intent; studios can add AI DMs that scale solo-play experiences and increase engagement.

Who Should Care

Summary TLDR

The authors built ChatRPG: a text-based solo role-playing system and compared two Game Master designs. v1 used long prompt engineering with GPT‑4. v2 split responsibilities into two ReAct agents (Narrator + Archivist) that call JSON tools to act and update persistent state. A counterbalanced user study (N=12) shows v2 gave higher ratings for perceived intelligence, immersion, mastery, coherence, and curiosity. Code is published. Main limits: small sample, prompt sensitivity, and API content filters.

Problem Statement

Solo tabletop-style role-playing needs a dependable Game Master (GM). Simple prompt-only LLMs can produce engaging text but struggle with long-term coherence, state tracking, and complex actions. The paper asks whether an agentic, tool-enabled ReAct design with a dedicated memory agent improves player experience over a prompt-only approach.

Main Contribution

Built ChatRPG v1 (prompt-engineered, state-in-prompt) and v2 (multi-agent ReAct with tools and persistent state).

Designed two specialized agents: Narrator (story & actions) and Archivist (persistent game state/memory).

Key Findings

Players rated the agentic v2 higher on multiple engagement measures.

NumbersN=12; 9/14 constructs significant; example: Mastery 0.682.33 (p=0.004)

Practical UseUse a ReAct multi-agent design with a memory agent to improve immersion, perceived intelligence, and player mastery in text RPGs.

Evidence RefTable 1 (PXI constructs) and comparative study section

v2 achieved statistically significant improvements in coherence and immersion.

NumbersCoherent story 1.002.25 (p=0.04); Immersion 1.642.42 (p=0.034)

Practical UseSeparate narration and state-tracking responsibilities to maintain story coherence during longer sessions.

Evidence RefTable 1 (Coherent story, Immersion rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Mastery (PXI)v1 mean=0.68, v2 mean=2.33v1+1.65N=12, paired testt=-3.683, p=0.004Table 1
Coherent storyv1 mean=1.00, v2 mean=2.25v1+1.25N=12, paired testt=-2.322, p=0.04Table 1

What To Try In 7 Days

Prototype a prompt-only GM to validate basic narrative style and UI.

Add a single background process that stores structured JSON state instead of stuffing history into prompts.

Implement one ReAct tool (e.g., battle resolution) and few-shot examples to test action/tool wiring with an LLM.

Agent Features

Memory
Archivist persistent JSON game state (structured memory)
Planning
ReAct: Thought-Action-Observation trajectories
Tool Use
JSON tool calls: Battle, WoundCharacter, HealCharacter, UpdateCharacter, UpdateEnvironment
Frameworks
LangChain (pattern referenced)OpenAI API (LLM provider)
Is Agentic

Yes

Architectures
multi-agent (Narrator + Archivist)
Collaboration
separation of duties: narrator (narrative+actions) vs archivist (state updates)

Optimization Features

Token Efficiency
reduces token growth by storing structured state externally instead of entire conversation

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small sample size (N=12) limits statistical power and external validity.

Behavior is sensitive to prompt wording and few-shot examples; brittle tuning required.

When Not To Use

When you need deterministic reproducibility across runs or exact reproducible outputs.

When content moderation will block core gameplay (e.g., violent fantasy) and no alternative model is available.

Failure Modes

Hallucinations that introduce inconsistent facts or NPC traits.

Context window overflow if state is not properly summarized or persisted.

Core Entities

Models

ChatGPT-4 (OpenAI)

Metrics

Player Experience Inventory (PXI) constructs: Ease of control, Mastery, Immersion, Curiosity, CohereCustom survey items (story interest, likely to play again, satisfaction)