Share a recurrent transformer memory so agents coordinate implicitly and solve long narrow‑corridor pathfinding

January 22, 20258 min

Overview

Decision SnapshotNeeds Validation

The method shows solid empirical gains on benchmarks and a public codebase, but assumptions (perfect localization, synchronized moves) and lack of theoretical guarantees limit immediate deployment in safety-critical systems.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SRMT offers a lightweight way to improve decentralized multi-robot coordination without centralized control; it can cut coordination failures and extend policies trained on small maps to larger deployments.

Who Should Care

Summary TLDR

SRMT adds a shared recurrent memory to transformer-based agent policies so agents can read/write a global workspace. On toy bottleneck tasks and the POGEMA benchmark, SRMT improves coordination versus non-sharing and communication baselines, generalizes to much longer corridors than seen in training, and scales to large lifelong MAPF scenarios. Code is available on GitHub for reproducible experiments.

Problem Statement

Coordinating many decentralized agents is hard because each agent sees only local observations and explicit communication protocols are costly or brittle. The paper asks: can a globally shared recurrent memory (a broadcast workspace) let decentralized transformer agents exchange information implicitly, avoid deadlocks, and generalize to larger map sizes?

Main Contribution

Shared Recurrent Memory Transformer (SRMT): a multi-agent transformer that pools each agent's recurrent memory and broadcasts it globally via cross-attention.

Empirical tests showing SRMT outperforms several MARL and memory/communication baselines on a two-agent Bottleneck task and is competitive on the POGEMA benchmark.

Key Findings

SRMT keeps near-perfect cooperative success on long corridors after training on short ones.

Numberstrained on corridors 330 cells; CSR ≈1.0 up to 400 cells, drops to 0.8 beyond 400

Practical UseIf you train on short layouts, SRMT can generalize to much longer narrow passages, so reuse of small-map training is feasible for longer deployments.

Evidence RefFigure 4; text in Section 4.1

Under sparse rewards SRMT maintains performance while other baselines fail.

NumbersSRMT outperforms baselines on Sparse and Moving Negative reward settings (CSR/ISR/SoC metrics)

Practical UseSRMT is a good option when reward signals are rare or delayed; the shared memory helps agents learn coordination with weak feedback.

Evidence RefFigure 3 and Appendix A.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Cooperative Success Rate (CSR)≈1.0 up to corridor 400, drops to 0.8 beyond 400 (Sparse reward)RMT and other baselinesSRMT > baselines on Sparse and Moving NegativeBottleneck task, evaluation corridors 51000Figure 4 and Section 4.1Figure 4
Top performance on Moving Negative rewardSRMT is top-1 for CSR/ISR/SoC up to corridor length 1000MAMBA, QPLEX, ATM, RATE, RRNN, RNNSRMT outperforms all listed baselinesBottleneck task with Moving Negative rewardSection 4.1 and Figure 4Figure 4

What To Try In 7 Days

Run the authors' SRMT code on POGEMA to reproduce baseline bottleneck results.

Swap in SRMT's shared-memory block for existing transformer agents to test coordination gains on your maps.

Combine SRMT with a simple heuristic planner (Follower-style) on high-congestion maps to see throughput gains.

Agent Features

Memory
Per-agent recurrent memory vectorsGlobal pooled memory broadcast each stepMemory head updates personal memory
Planning
Integration with heuristic planner (Follower) optional
Tool Use
ResNet spatial encoderGPT-2 style attention blockCross-attention to shared memory
Frameworks
POGEMASample FactoryHuggingface Transformers
Is Agentic

Yes

Architectures
Transformer-based policy with memory tokensShared recurrent memory (global workspace)
Collaboration
Implicit communication via shared memoryDecentralized training and execution (no central controller)

Optimization Features

Infra Optimization
Reported training runs on a single Tesla P100 (MAPF models ~1 hour per run)
Model Optimization
Memory tokens as compact recurrent state
System Optimization
Batching via Sample Factory for many environments
Training Optimization
Shared homogeneous policy across agentsGrid search for entropy coefficient and learning rate
Inference Optimization
Single forward pass with cross-attention to shared memory

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Assumes perfect agent localization and mapping.

Assumes synchronized and accurate action execution.

When Not To Use

When agents have noisy localization or unreliable actuators.

When formal safety or liveness guarantees are required.

Failure Modes

Performance drops outside tested regimes (CSR drops after corridor >400 in Sparse setting).

Potential deadlocks if memory initialization or write/read policies are poor.

Core Entities

Models

SRMTRMTATMRATERATE_genRRNNMAMBAQPLEXRNNEmptyRHCRFollowerMATS-LPSCRIMP

Metrics

Cooperative Success Rate (CSR)Individual Success Rate (ISR)Sum-of-Costs (SoC)Average throughputPerformance (relative throughput)Pathfinding (optimality)CongestionCooperationOut-of-DistributionScalability

Datasets

POGEMAMovingAIMovingAI-tilesBottleneck toy mapsMazesRandomWarehouse

Benchmarks

POGEMA benchmarkBottleneck taskLifelong MAPF (LMAPF)