Use temporal contrastive embeddings + goal-conditioned policies to transfer multi-agent skills and generate sub-goals

June 3, 20246 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Weihao Zeng, Joseph Campbell, Simon Stepputtis, Katia Sycara

Links

Abstract / PDF

Why It Matters For Business

If you reuse pre-trained goal-conditioned policies and learn temporal sub-goals, you can drastically cut simulator training costs and reach equal or better coordinated performance on new multi-agent tasks.

Summary TLDR

The paper presents a three-step transfer recipe for multi-agent RL: pre-train a goal-conditioned policy, finetune in the target, learn a temporal contrastive embedding of rollouts, cluster the embedding into nodes, and build a planning graph to produce sub-goals. On Overcooked multi-agent tasks this approach reaches similar or better final performance than baselines while using far fewer environment samples (reported average 4.6x faster convergence and 21.7% of samples). Sub-goals are interpretable (e.g., fetch onion, load oven). Results are limited to simulated Overcooked layouts and require a single expert demonstration for guidance.

Problem Statement

Multi-agent RL is often too slow to train from scratch in new tasks because the joint state/action space is large and rewards are sparse; we need transfer methods that reuse prior skills, discover useful temporal abstractions automatically, and produce interpretable sub-goals to guide learning.

Main Contribution

A three-stage transfer pipeline: pre-train goal-conditioned RL, finetune in target, then learn temporal embeddings and build a planning graph for sub-goal generation.

A temporal contrastive learning objective (InfoNCE-style) that maps observations to an embedding where geometric distance reflects temporal distance.

A practical execution loop that uses cluster-based planning on the learned graph to pick sub-goals for the finetuned goal-conditioned policy.

Key Findings

Large sample savings: method reaches convergence faster than baselines on Overcooked transfers.

NumbersAverage 4.6× faster convergence than fastest baselines (reported)

Final task performance matches or exceeds baselines on evaluated tasks.

NumbersMax soups delivered: Cilantro 12.58 (ours) vs 11.22 (best baseline)

Extreme sample reduction claim across experiments.

NumbersMethod uses 21.7% of training samples compared to state-of-the-art baselines

Results

Steps to convergence (Cilantro)

Value680.5K (ours)

Baseline5.0M (Vanilla RL best baseline reported)

Max soups delivered (Cilantro)

Value12.58 (ours)

Baseline11.22 (Fine-tuning best baseline reported)

Steps to convergence (Small Corridor)

Value1.1M (ours)

Baseline5.0M (best baseline JSRL variant)

Max soups delivered (Small Corridor)

Value4.92 (ours)

Baseline0.42 (JSRL best baseline reported)

Who Should Care

What To Try In 7 Days

Pre-train a goal-conditioned policy on a source layout (use PPO + UVFA).

Collect rollouts in the target env guided by one expert demo and train an InfoNCE contrastive encoder on state pairs within T steps.

Cluster the embeddings, build a transition graph from the demo, and run the finetuned agent with graph-derived sub-goals.

Agent Features

Memory

  • Temporal abstraction via embedding (captures short- to mid-horizon structure)

Planning

  • Graph-based planning over cluster nodes
  • Shortest-path sub-goal selection

Frameworks

  • PPO
  • UVFA
  • InfoNCE contrastive learning

Is Agentic

true

Architectures

  • Goal-conditioned policy (UVFA)
  • Contrastive encoder for temporal embedding
  • Cluster-to-graph planner

Collaboration

  • Multi-agent coordination through shared sub-goals

Optimization Features

Training Optimization

  • Pre-training on source then finetuning on target to reuse skills

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Validated only in simulated Overcooked layouts; real-world dynamics not tested
  • Method requires a single successful demonstration for finetuning and graph construction
  • Learned temporal distances are approximations from noisy rollouts and depend on representative data

When Not To Use

  • Domains without a clear state observation that matches training rollouts
  • Tasks where collecting representative rollouts or a reliable demo is impossible
  • Real robots until sim-to-real validation is available

Failure Modes

  • Poor or biased expert demo produces bad graphs and misleading sub-goals
  • Noisy or unrepresentative rollouts cause clusters that do not reflect true bottlenecks
  • Source-policy bias can hinder transfer if pre-training task is misleadingly different

Core Entities

Models

  • Goal-conditioned policy (UVFA)
  • Proximal Policy Optimization (PPO)
  • Temporal contrastive encoder (InfoNCE)

Metrics

  • Soups delivered per episode
  • Steps to convergence (90% of max performance)

Datasets

  • Overcooked (simulated) environment
  • Expert demonstration trajectory (single example)

Benchmarks

  • Overcooked transfer tasks: Cilantro, Cilantro Left, Small Corridor, Corridor