HiMPo: learn feudal (multi-level) message-passing policies and use upper-level advantage signals to train lower levels.

July 31, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.4

Citation Count

0

Authors

Tommaso Marzi, Cesare Alippi, Andrea Cini

Links

Abstract / PDF

Why It Matters For Business

If your product uses many cooperating agents (robot fleets, multi-robot exploration, distributed sensors), HiMPo offers a way to combine hierarchical planning with local message-passing. That reduces greedy/short-term behaviour and improves coordination without handcrafting low-level rewards.

Summary TLDR

This paper introduces HiMPo, a method that combines feudal hierarchical reinforcement learning (HRL) with message-passing (graph) communication. Upper-level managers send goals to lower levels; lower levels are trained using reward signals derived from the upper-level advantage function (a measure of how good a high-level goal was). The authors prove these level-specific rewards align with the global task and show empirical gains on three multi-agent benchmarks (custom Level-Based Foraging with Survival, VMAS Sampling, SMACv2) versus strong PPO-based baselines. Implementation uses PPO (HiMPPO).

Problem Statement

Decentralized multi-agent learners face partial observability and non-stationarity that hurt coordination and long-horizon planning. Existing hierarchical designs need ad-hoc, hand-crafted intrinsic rewards for low levels and are hard to scale in multi-agent settings with message passing.

Main Contribution

A new feudal HRL method for multi-agent systems that integrates message-passing (graph) policies across multiple levels (HiMPo).

A practical reward-assignment scheme: lower-level rewards are based on the advantage function of the immediate upper level, avoiding manual reward shaping.

Theoretical proofs showing level-specific objectives are aligned with the joint/global return under stated assumptions.

PPO-based implementation (HiMPPO) and empirical validation on LBFwS, VMAS Sampling, and SMACv2; multiple ablations examine topology, hierarchy depth, and reward schemes.

Key Findings

HiMPPO sustains coordinated, non-greedy team strategies on a hard cooperative foraging task when baselines fail.

NumbersLBFwS: 10 agents; experiments averaged over 8 runs; on LBFwS-Hard only HiMPPO avoided greedy individual play (Fig.2).

A dynamic 3-level hierarchical graph improves exploration and final return in a continuous multi-robot sampling task.

NumbersVMAS Sampling: K=2, α=5; results averaged over 6 runs; dynamic 3-level HiMPPO outperformed static-3-level and 2-level Hi

Advantage-based local rewards (no direct external reward exposure to workers) yield better sample efficiency than exposing workers to environment rewards or giving only external signals.

NumbersAblations in VMAS Sampling: comparison across HiMPPO, HiMPPO-FL, HiMPPO-NL, HiMPPO-ER averaged over 6 runs (Fig.6) show:

Results

LBFwS performance (Easy/Medium/Hard)

ValueHiMPPO achieves high returns across difficulties; baselines fail on Hard by converging to greedy behavior

BaselineIPPO, MAPPO, GPPO

VMAS Sampling average return

ValueHiMPPO outperforms baselines, with larger gains for bigger team sizes

BaselineIPPO, MAPPO, GPPO

SMACv2 win rate

ValueHiMPPO achieves competitive win-rate compared to IPPO and MAPPO but is less sample efficient

BaselineIPPO, MAPPO, GPPO

Ablation — communication topology

ValueHiMPPO beats fixed-topology GPPO variants (star, complete, path, cycle) on larger systems

BaselineGPPO variants

Ablation — reward scheme

ValueAdvantage-based local rewards (HiMPPO) improve sample efficiency; removing local rewards or exposing workers to external

BaselineHiMPPO-FL, HiMPPO-NL, HiMPPO-ER

Who Should Care

What To Try In 7 Days

Re-implement top-down goals over your current decentralized PPO agents: add a manager that emits Gaussian goals every α steps and train workers to maximize manager-derived advantag

If agents are mobile, build a dynamic hierarchy from agent positions (e.g., spatial clustering) and compare static vs dynamic partitions.

Run an ablation: replace worker environment reward with manager-advantage-based intrinsic reward and check sample efficiency.

Agent Features

Memory

  • temporal abstraction via multi-step goals (goal lasts α steps)

Planning

  • hierarchical planning across time scales (α, K)
  • spatio-temporal task decomposition

Tool Use

  • RL
  • GNN message functions (MLP/GCN)

Frameworks

  • RL

Is Agentic

true

Architectures

  • feudal hierarchical policies (manager – sub-manager – worker)
  • message-passing Graph Neural Networks

Collaboration

  • local message exchange between neighbors
  • managers aggregate local worker returns

Optimization Features

Infra Optimization

  • experiments run on commodity CPUs; training time varies per env (hours–days)

Model Optimization

  • policy-gradient optimization (PPO) at each level

System Optimization

  • shared message/update functions across nodes at same level to limit params

Training Optimization

  • concurrent updates of all levels for sample efficiency
  • GAE for advantage estimation; λ=0.95 for workers and λ=0 for upper levels

Inference Optimization

  • decentralized execution: workers act on local obs and received goals

Reproducibility

Data Urls

  • VMAS simulator repo (VMAS)
  • Level-Based Foraging codebase
  • SMACv2 repo

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Upper levels operate at fixed time scales (α, K) in current implementation; no adaptive/asynchronous goal timing.
  • 3-level hierarchies assume cooperative settings; deeper hierarchies not evaluated in mixed-competitive scenarios.
  • Training evaluates trajectories using the initial hierarchy G*0 (training-time static evaluation), which simplifies dynamic topology learning but may hide topology changes during optimization.
  • No public code release stated; reproducing exact configs relies on many implementation details in appendix.

When Not To Use

  • Small fully observable cooperative tasks where independent PPO already suffices (e.g., SMACv2 maps where independent learning performs well).
  • Strictly adversarial competitions where hierarchy assumptions or cooperative partitions break down (unless using a 2-level design).
  • Settings where you cannot tune α, K or cannot afford added architectural complexity.

Failure Modes

  • If hierarchical rewards are misspecified, sub-managers may compete or induce greedy behavior.
  • Fully connecting nodes (complete graph) can make optimization unstable, per ablations.
  • High truncation values for advantage-like rewards complicate credit assignment and can degrade performance.

Core Entities

Models

  • HiMPo
  • HiMPPO
  • GPPO
  • MAPPO
  • IPPO
  • PPO

Metrics

  • average return
  • win rate
  • sample standard deviation

Datasets

  • LBFwS (Level-Based Foraging with Survival)
  • VMAS Sampling
  • SMACv2

Benchmarks

  • LBFwS
  • VMAS Sampling
  • SMACv2

Context Entities

Models

  • RL
  • Graph Neural Networks (GNNs)
  • Generalized Advantage Estimation (GAE)

Metrics

  • training time (hours/days)
  • sample efficiency

Datasets

  • original LBF (Level-Based Foraging)
  • VMAS simulator
  • SMAC / SMACv2

Benchmarks

  • Level-Based Foraging
  • StarCraft Multi-Agent Challenge (SMACv2)