Replace slow online search with learned hierarchical agents to cut responder reallocation time from minutes to fractions of a second

Overview

Decision SnapshotReady For Pilot

The method shows strong latency and modest response-time gains on real datasets; training complexity and city-specific tuning moderate immediate turnkey use in new cities.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Amutheezan Sivagnanam, Ava Pettet, Hunter Lee, Ayan Mukhopadhyay, Abhishek Dubey, Aron Laszka

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Switching from search-based planners to learned hierarchical agents delivers sub-second reallocation decisions and modestly shorter ambulance response times on real-city data, enabling practical real-time deployment and lower operational latency.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Data Scientist

Summary TLDR

The paper replaces an MCTS-based hierarchical planner for emergency responder repositioning with learned actor-critic agents. Low-level agents use TransformerXL to handle a variable number of responders per region; high-level agent allocates responders across regions and uses low-level critics to estimate rewards. Continuous actor outputs are discretized with combinatorial solvers (max-weight matching and min-cost flow). On real data from Nashville and Seattle the learned system lowers decision latency from ~3 minutes to ~0.22 seconds and reduces average response times by several seconds vs MCTS, while training requires days per agent.

Problem Statement

Proactively repositioning ambulances is a combinatorial, time-sensitive decision problem. The current hierarchical planner uses Monte-Carlo tree search (MCTS) and can take minutes per decision, which is too slow for emergency response. The challenge is to make decisions near-instantly without losing solution quality in a huge, variable, discrete state-action space.

Main Contribution

A hierarchical multi-agent RL system that replaces MCTS low- and high-level planners with actor-critic agents to make near-instant reallocation decisions.

Design of fixed-size feature projections and TransformerXL-based low-level actors to handle variable numbers of responders per region.

Key Findings

Decision latency cut from minutes to fractions of a second.

Numbers0.22s per decision vs 3 min (≈180s) for MCTS

Practical UseReplace MCTS with the learned actor to achieve real-time reallocation (sub-second latency) in live ERM systems.

Evidence RefSec 4.4; Fig.4

Average response time reduced versus MCTS on evaluated city data.

NumbersNashville: −5s (5-region) to −13s (7-region); Seattle: −10s on average

Practical UseDeploying learned hierarchical agents can shave multiple seconds from expected ambulance response times on historical workloads.

Evidence RefSec 4.4; Fig.4, Fig.10

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Decision computation time per decision	0.22 s (DDPG agent)	3 min (MCTS)	≈818× faster than MCTS	Nashville evaluation	Sec 4.4 reports 0.22s vs 3 minutes for MCTS	Sec 4.4; Fig.4
Average response time change vs MCTS	−5s (5-region) to −13s (7-region) in Nashville	MCTS	savings of 5–13 seconds	Nashville (10 chains)	Sec 4.4; Fig.4a–c	Sec 4.4; Fig.4

What To Try In 7 Days

Run the provided code on one held-out historical incident chain to measure latency and response-time deltas.

Swap an MCTS low-level planner with the pre-trained TrXL low-level actor and measure end-to-end decision latency.

Validate the reward-estimation trick by comparing high-level actions using LLP critics versus raw response-time rewards on a single region.

Agent Features

Memory

experience replay buffer

Planning

hierarchical coordinationhigh-level region allocationlow-level depot assignment

Tool Use

maximum-weight matchingminimum-cost flowcontraction hierarchiesOSRM

Frameworks

hierarchical decomposition (regions / depots)DDPG training loop

Is Agentic

Yes

Architectures

actor-criticDDPGTransformerXL (TrXL) actorMLP critics

Collaboration

independent low-level agents coordinated by a high-level agentlow-level critics used to inform high-level rewards

Optimization Features

Infra Optimization

CPU-based training reported; training time varies by region size

Model Optimization

architecture search for TrXL hyperparameters

System Optimization

map variable-sized state to fixed-size featuresuse TrXL attention to handle variable responder counts

Training Optimization

experience replayactor-critic (DDPG)reward estimation via LLP critics

Inference Optimization

continuous actor outputs for fast forward pass (sub-second inference)discretization via fast combinatorial solvers

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://doi.org/10.6084/m9.figshare.25872640

Data URLs

https://doi.org/10.6084/m9.figshare.25872640 https://data.seattle.gov/dataset/Collisions-All-Years/pfun-q57u

Risks & Boundaries

Limitations

Training low-level agents can take many days for regions with many depots (up to 14 days).

Approach depends on incident-rate forecasts and time-varying travel models; quality of inputs affects outcomes.

When Not To Use

You lack historical incident data or travel-time models to train and evaluate the agents.

You cannot afford multiple days of offline training per region topology.

Failure Modes

Inaccurate low-level critics misestimate regional value, leading to bad high-level reallocations.

Actor outputs map poorly under the discretization step in rare combinatorial edge cases.

Core Entities

Models

DDPGTransformerXL (TrXL)GTrXLLSTMMCTSDRLSN

Metrics

average response timedecision computation timetraining convergence time

Datasets

Nashville ERM dataset (36 depots, 26 responders; 60 chains)Seattle public collisions dataset (City of Seattle 2022; 60 chains)

Benchmarks

MCTS (Pettet et al., 2022)p-median-based policygreedy policystatic policyDRLSN (Ji et al., 2019)

Context Entities

Models

Feudal / hierarchical RLAttention-based coordination (TrXL)

Metrics

p-values from paired permutation tests

Datasets

Historical incident chains and rate forecasts

Benchmarks

Prior hierarchical planning (Pettet et al., 2022)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Decision latency cut from minutes to fractions of a second.

Average response time reduced versus MCTS on evaluated city data.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding