Multi‑agent ReAct Game Master outperforms prompt‑only GM in solo RPGs

Overview

Decision SnapshotNeeds Validation

The paper gives a practical, runnable system and measured user gains, but the small sample, prompt brittleness, and reliance on proprietary LLMs limit generalizability.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Nicolai Hejlesen Jørgensen, Sarmilan Tharmabalan, Ilhan Aslan, Nicolai Brodersen Hansen, Timothy Merritt

Links

Abstract / PDF / Code

Why It Matters For Business

An agentic ReAct design with a memory agent measurably raises player immersion, coherence, and replay intent; studios can add AI DMs that scale solo-play experiences and increase engagement.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Founder

Summary TLDR

The authors built ChatRPG: a text-based solo role-playing system and compared two Game Master designs. v1 used long prompt engineering with GPT‑4. v2 split responsibilities into two ReAct agents (Narrator + Archivist) that call JSON tools to act and update persistent state. A counterbalanced user study (N=12) shows v2 gave higher ratings for perceived intelligence, immersion, mastery, coherence, and curiosity. Code is published. Main limits: small sample, prompt sensitivity, and API content filters.

Problem Statement

Solo tabletop-style role-playing needs a dependable Game Master (GM). Simple prompt-only LLMs can produce engaging text but struggle with long-term coherence, state tracking, and complex actions. The paper asks whether an agentic, tool-enabled ReAct design with a dedicated memory agent improves player experience over a prompt-only approach.

Main Contribution

Built ChatRPG v1 (prompt-engineered, state-in-prompt) and v2 (multi-agent ReAct with tools and persistent state).

Designed two specialized agents: Narrator (story & actions) and Archivist (persistent game state/memory).

Key Findings

Players rated the agentic v2 higher on multiple engagement measures.

NumbersN=12; 9/14 constructs significant; example: Mastery 0.68→2.33 (p=0.004)

Practical UseUse a ReAct multi-agent design with a memory agent to improve immersion, perceived intelligence, and player mastery in text RPGs.

Evidence RefTable 1 (PXI constructs) and comparative study section

v2 achieved statistically significant improvements in coherence and immersion.

NumbersCoherent story 1.00→2.25 (p=0.04); Immersion 1.64→2.42 (p=0.034)

Practical UseSeparate narration and state-tracking responsibilities to maintain story coherence during longer sessions.

Evidence RefTable 1 (Coherent story, Immersion rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Mastery (PXI)	v1 mean=0.68, v2 mean=2.33	v1	+1.65	N=12, paired test	t=-3.683, p=0.004	Table 1
Coherent story	v1 mean=1.00, v2 mean=2.25	v1	+1.25	N=12, paired test	t=-2.322, p=0.04	Table 1

What To Try In 7 Days

Prototype a prompt-only GM to validate basic narrative style and UI.

Add a single background process that stores structured JSON state instead of stuffing history into prompts.

Implement one ReAct tool (e.g., battle resolution) and few-shot examples to test action/tool wiring with an LLM.

Agent Features

Memory

Archivist persistent JSON game state (structured memory)

Planning

ReAct: Thought-Action-Observation trajectories

Tool Use

JSON tool calls: Battle, WoundCharacter, HealCharacter, UpdateCharacter, UpdateEnvironment

Frameworks

LangChain (pattern referenced)OpenAI API (LLM provider)

Is Agentic

Yes

Architectures

multi-agent (Narrator + Archivist)

Collaboration

separation of duties: narrator (narrative+actions) vs archivist (state updates)

Optimization Features

Token Efficiency

reduces token growth by storing structured state externally instead of entire conversation

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/KarmaKamikaze/ChatRPG

Risks & Boundaries

Limitations

Small sample size (N=12) limits statistical power and external validity.

Behavior is sensitive to prompt wording and few-shot examples; brittle tuning required.

When Not To Use

When you need deterministic reproducibility across runs or exact reproducible outputs.

When content moderation will block core gameplay (e.g., violent fantasy) and no alternative model is available.

Failure Modes

Hallucinations that introduce inconsistent facts or NPC traits.

Context window overflow if state is not properly summarized or persisted.

Core Entities

Models

ChatGPT-4 (OpenAI)

Metrics

Player Experience Inventory (PXI) constructs: Ease of control, Mastery, Immersion, Curiosity, CohereCustom survey items (story interest, likely to play again, satisfaction)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Players rated the agentic v2 higher on multiple engagement measures.

v2 achieved statistically significant improvements in coherence and immersion.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding