Use a central controller + simple planner and Options to make multi-agent Q‑learning learn faster on grid tasks

February 7, 20236 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

0

Authors

Alakh Aggarwal, Rishita Bansal, Parth Padalkar, Sriraam Natarajan

Links

Abstract / PDF

Why It Matters For Business

Decomposing multi-agent tasks and adding a cheap planner can speed learning with simple algorithms, reducing training time and compute for small structured problems.

Summary TLDR

The authors build a multi-agent reinforcement learning setup where a central controller assigns tasks and two-level Options (pickup, drop) decompose the task. In a grid 'Bank world' with 2 agents and 3 gems, Q-learning with Options plus a Manhattan-distance planner learns faster and gets higher average reward than plain Q‑learning or random policy under a 6k-episode training budget. The approach is a practical, low-compute way to speed learning in small grid problems but is only validated on toy environments with tabular Q-tables.

Problem Statement

Coordinating multiple agents is hard because the joint state space grows fast. The paper asks whether a centralized controller, a simple heuristic planner, and the Options framework (short temporal subpolicies) speed learning and improve performance versus plain Q‑learning in a grid pickup-and-drop task.

Main Contribution

A centralized controller that maintains Q-tables and assigns actions to agents.

A planner that assigns nearest targets using Manhattan distance to focus agents.

Use of the Options Framework to split the task into pickup and drop subtasks, each with its own Q-table.

Key Findings

Q‑learning with Options produced the highest average reward in test runs compared to plain Q‑learning and random policy.

Options reduced required training under the authors' budget: plain Q‑learning did not learn well within the 6k-episode budget while Options did.

Numberstraining budget = 6k episodes; episode length = 1k steps

Adding a simple planner (Manhattan-distance assignment) made learning more target-oriented and faster than the same algorithm without the planner.

Results

average reward (test runs)

ValueQ-learning + Options ranked best; Q-learning second; random worst

Baselinerandom policy

training efficiency

ValueOptions reached better performance within 6k episode budget

Baselineplain Q-learning (6k episodes)

effect of planner

ValuePlanner-enabled method learned faster and more target-directed than without planner

BaselineQ-learning + Options without planner

Who Should Care

What To Try In 7 Days

Recreate the Bank world toy with 2 agents and 3 static targets to reproduce baseline behavior.

Implement a central controller that maintains shared Q-tables and a Manhattan-distance planner for goal assignment.

Split a task into two options (e.g., pickup and drop) and train separate Q-tables to compare convergence speed.

Agent Features

Memory

  • Shared Q-tables as central memory

Planning

  • Manhattan-distance target assignment
  • Central controller assigns goals to free agents

Tool Use

  • Tabular Q-learning
  • TD updates
  • Options Framework
  • D-FOCI state abstraction

Frameworks

  • Options Framework
  • D-FOCI (abstraction statements)

Is Agentic

true

Architectures

  • Centralized controller
  • Options Framework (hierarchical options)
  • Heuristic planner (Manhattan)

Collaboration

  • Centralized coordination (controller assigns tasks)
  • Agents update central Q-tables

Optimization Features

Training Optimization

  • Task decomposition via Options to reduce effective learning time

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Validated only on a small toy grid ('Bank world') with static gems
  • Experiments use tabular Q-tables; no function approximator tested
  • Results reported for 2 agents and 3 gems only
  • Method depends on grid dimensions and bank-at-center assumption

When Not To Use

  • Large continuous state spaces where tabular Q-tables do not scale
  • Tasks with many moving targets where static Manhattan planner breaks
  • Settings that require function approximation (deep RL) without adaptation

Failure Modes

  • Centralized Q-tables may not scale as number of agents grows
  • Planner assignments can be suboptimal when targets move
  • State abstraction via D-FOCI may omit dependencies between agents

Core Entities

Models

  • Q-learning
  • Q-learning + Options
  • Random policy
  • Options Framework
  • Centralized controller
  • Manhattan-distance planner
  • D-FOCI (domain-specific FOCI statements)

Metrics

  • total reward
  • average reward
  • training episodes to converge

Datasets

  • Bank world (variant of Box world)