Replace slow online search with learned hierarchical agents to cut responder reallocation time from minutes to fractions of a second

May 21, 20249 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Amutheezan Sivagnanam, Ava Pettet, Hunter Lee, Ayan Mukhopadhyay, Abhishek Dubey, Aron Laszka

Links

Abstract / PDF

Why It Matters For Business

Switching from search-based planners to learned hierarchical agents delivers sub-second reallocation decisions and modestly shorter ambulance response times on real-city data, enabling practical real-time deployment and lower operational latency.

Summary TLDR

The paper replaces an MCTS-based hierarchical planner for emergency responder repositioning with learned actor-critic agents. Low-level agents use TransformerXL to handle a variable number of responders per region; high-level agent allocates responders across regions and uses low-level critics to estimate rewards. Continuous actor outputs are discretized with combinatorial solvers (max-weight matching and min-cost flow). On real data from Nashville and Seattle the learned system lowers decision latency from ~3 minutes to ~0.22 seconds and reduces average response times by several seconds vs MCTS, while training requires days per agent.

Problem Statement

Proactively repositioning ambulances is a combinatorial, time-sensitive decision problem. The current hierarchical planner uses Monte-Carlo tree search (MCTS) and can take minutes per decision, which is too slow for emergency response. The challenge is to make decisions near-instantly without losing solution quality in a huge, variable, discrete state-action space.

Main Contribution

A hierarchical multi-agent RL system that replaces MCTS low- and high-level planners with actor-critic agents to make near-instant reallocation decisions.

Design of fixed-size feature projections and TransformerXL-based low-level actors to handle variable numbers of responders per region.

Combinatorial discretization: max-weight matching (low level) and minimum-cost flow (high level) to convert continuous actor outputs to valid allocations.

A reward-estimation trick for the high-level agent: use low-level critics, weighted by regional incident rates, to reduce reward noise.

Empirical evaluation on real-world Nashville and Seattle data showing large latency reduction and modest response-time gains.

Open-source code and data release for reproduction and baselines.

Key Findings

Decision latency cut from minutes to fractions of a second.

Numbers0.22s per decision vs 3 min (≈180s) for MCTS

Average response time reduced versus MCTS on evaluated city data.

NumbersNashville: −5s (5-region) to −13s (7-region); Seattle: −10s on average

Centralized learned policy performs much worse than hierarchical variant.

NumbersCentralized variant is +66s worse on evaluated chains

Training cost and time are non-trivial.

NumbersLow-level agents: ~1 day up to 14 days; High-level agent: ~2 days

Policies are robust to moderate observation noise.

NumbersLittle performance loss up to ±20% multiplicative noise

Results

Decision computation time per decision

Value0.22 s (DDPG agent)

Baseline3 min (MCTS)

Average response time change vs MCTS

Value−5s (5-region) to −13s (7-region) in Nashville

BaselineMCTS

Average response time change vs MCTS (Seattle)

Value−10s on average

BaselineMCTS

Centralized vs Hierarchical

Value+66 s (centralized worse)

BaselineHierarchical learned agent

Training time (convergence)

ValueLow-level: ~1 day (≤8 depots) to 14 days; High-level: ~2 days

Who Should Care

What To Try In 7 Days

Run the provided code on one held-out historical incident chain to measure latency and response-time deltas.

Swap an MCTS low-level planner with the pre-trained TrXL low-level actor and measure end-to-end decision latency.

Validate the reward-estimation trick by comparing high-level actions using LLP critics versus raw response-time rewards on a single region.

Agent Features

Memory

  • experience replay buffer

Planning

  • hierarchical coordination
  • high-level region allocation
  • low-level depot assignment

Tool Use

  • maximum-weight matching
  • minimum-cost flow
  • contraction hierarchies
  • OSRM

Frameworks

  • hierarchical decomposition (regions / depots)
  • DDPG training loop

Is Agentic

true

Architectures

  • actor-critic
  • DDPG
  • TransformerXL (TrXL) actor
  • MLP critics

Collaboration

  • independent low-level agents coordinated by a high-level agent
  • low-level critics used to inform high-level rewards

Optimization Features

Infra Optimization

  • CPU-based training reported; training time varies by region size

Model Optimization

  • architecture search for TrXL hyperparameters

System Optimization

  • map variable-sized state to fixed-size features
  • use TrXL attention to handle variable responder counts

Training Optimization

  • experience replay
  • actor-critic (DDPG)
  • reward estimation via LLP critics

Inference Optimization

  • continuous actor outputs for fast forward pass (sub-second inference)
  • discretization via fast combinatorial solvers

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Training low-level agents can take many days for regions with many depots (up to 14 days).
  • Approach depends on incident-rate forecasts and time-varying travel models; quality of inputs affects outcomes.
  • Assumes depots modeled as single capacity by default; multi-capacity depots require modeling workarounds.
  • High-level rewards rely on low-level critics; critic bias can propagate to poor region allocations.

When Not To Use

  • You lack historical incident data or travel-time models to train and evaluate the agents.
  • You cannot afford multiple days of offline training per region topology.
  • You need a controller that guarantees optimality under adversarial or highly nonstationary demand without retraining.

Failure Modes

  • Inaccurate low-level critics misestimate regional value, leading to bad high-level reallocations.
  • Actor outputs map poorly under the discretization step in rare combinatorial edge cases.
  • Unseen, extreme demand patterns degrade learned policies compared to planning-based methods.

Core Entities

Models

  • DDPG
  • TransformerXL (TrXL)
  • GTrXL
  • LSTM
  • MCTS
  • DRLSN

Metrics

  • average response time
  • decision computation time
  • training convergence time

Datasets

  • Nashville ERM dataset (36 depots, 26 responders; 60 chains)
  • Seattle public collisions dataset (City of Seattle 2022; 60 chains)

Benchmarks

  • MCTS (Pettet et al., 2022)
  • p-median-based policy
  • greedy policy
  • static policy
  • DRLSN (Ji et al., 2019)

Context Entities

Models

  • Feudal / hierarchical RL
  • Attention-based coordination (TrXL)

Metrics

  • p-values from paired permutation tests

Datasets

  • Historical incident chains and rate forecasts

Benchmarks

  • Prior hierarchical planning (Pettet et al., 2022)