CH-MARL: hierarchical multi-agent RL with real-time constraint enforcement to cut emissions and balance costs in maritime logistics

February 4, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Saad Alqithami

Links

Abstract / PDF

Why It Matters For Business

CH-MARL offers a practical way to meet emission caps while coordinating many vessels; it can reduce fuel-related emissions and help comply with regulations at modest engineering cost, but needs pilot testing and constraint tuning before real deployment.

Summary TLDR

CH-MARL is a hierarchical multi-agent RL system that adds a real-time primal-dual constraint layer and a fairness-aware reward term to coordinate vessels and ports under global emission caps. In a digital-twin with 8 ports and 5 vessels, CH-MARL variants that include emission caps and fairness converge to stable policies and reduce fuel/emissions versus baselines. The method is a prototype validated in simulation; it needs real-world pilots and tuning before deployment.

Problem Statement

Maritime logistics must reduce greenhouse gases while preserving throughput and fair cost sharing. Existing MARL methods often ignore system-wide emission caps, fairness across heterogeneous stakeholders, and partial observability. The challenge is to learn coordinated policies that satisfy global constraints in real time, work under noisy/partial data, and avoid disadvantaging smaller operators.

Main Contribution

A CH-MARL framework that layers high-level strategic agents (route, budget, schedule) on low-level operational agents (speed, berthing) to scale learning.

A real-time primal-dual constraint enforcement layer that updates a global Lagrange multiplier to keep aggregate emissions within a cap.

A fairness-aware reward shaping module that penalizes disparity (e.g., via scaled Gini or max-min terms) to protect smaller stakeholders.

Key Findings

CH-MARL variants delivered lower cumulative emissions in the digital twin compared to the baseline.

NumbersRun A 4.7304 → Run D 4.07152 (−0.6589, −13.9%)

The primal-dual constraint layer kept aggregate emissions near the enforced cap during training.

Fairness-aware reward shaping reduced inequality across agents without breaking convergence.

Results

Total Emissions (Run A baseline)

Value4.7304

Total Emissions (Run D: Cap+Fair+Storms)

Value4.07152

BaselineRun A 4.7304

Reward (Run D)

Value-11.378

BaselineRun A −4.7304

Who Should Care

What To Try In 7 Days

Run a small digital-twin pilot with your fleet (few ports, few vessels) to reproduce emissions and throughput KPIs.

Implement a simple primal-dual penalty on aggregate emissions and observe if policies shift toward lower fuel use.

Add a small fairness penalty (scaled Gini) and check whether smaller operators' costs become more balanced.

Agent Features

Memory

  • partial observability handling (local observations)

Planning

  • strategic (route, budget) planning
  • operational (speed, berthing) control

Tool Use

  • digital twin simulation
  • primal-dual Lagrangian constraint layer
  • PPO / actor-critic

Frameworks

  • Constrained Markov Decision Process (CMDP)
  • primal-dual optimization

Is Agentic

true

Architectures

  • hierarchical
  • decentralized multi-agent

Collaboration

  • shared reward shaping
  • global constraint coordination

Optimization Features

Infra Optimization

  • parallel low-level policy updates suggested for multi-core/distributed training

System Optimization

  • hierarchical decomposition to reduce per-agent complexity

Training Optimization

  • policy-gradient / actor-critic
  • PPO with Adam optimizer
  • LoRA

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments use a small synthetic digital twin (8 ports, 5 vessels); results may not scale linearly.
  • Weather and mechanical failures are simplified to a few scenarios and fixed probabilities.
  • The setup focuses on cooperative/semi-cooperative settings; competitive market behaviors are not evaluated.
  • No public code or real-world deployment results reported.

When Not To Use

  • Directly deploying without pilot tests and constraint tuning in real operations.
  • Settings dominated by adversarial or highly competitive agents without redesigning reward structure.
  • Very large-scale fleets without additional state aggregation or distributed training engineering.

Failure Modes

  • Poorly tuned dual-variable learning rates can cause oscillating constraint violations or overly conservative behavior.
  • Fairness penalties that are too strong can reduce throughput and overall efficiency.
  • Partial observability may hide coordinated violations, requiring stronger monitoring or communication.

Core Entities

Models

  • Proximal Policy Optimization (PPO)
  • Actor-Critic / policy-gradient

Metrics

  • Total Emissions (CO2-equivalent)
  • Fuel Consumption
  • Gini coefficient (fairness)
  • Operational Throughput
  • Constraint Violation Rate
  • Queue Time

Datasets

  • Maritime digital twin (synthetic simulation)