Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
If you reuse pre-trained goal-conditioned policies and learn temporal sub-goals, you can drastically cut simulator training costs and reach equal or better coordinated performance on new multi-agent tasks.
Summary TLDR
The paper presents a three-step transfer recipe for multi-agent RL: pre-train a goal-conditioned policy, finetune in the target, learn a temporal contrastive embedding of rollouts, cluster the embedding into nodes, and build a planning graph to produce sub-goals. On Overcooked multi-agent tasks this approach reaches similar or better final performance than baselines while using far fewer environment samples (reported average 4.6x faster convergence and 21.7% of samples). Sub-goals are interpretable (e.g., fetch onion, load oven). Results are limited to simulated Overcooked layouts and require a single expert demonstration for guidance.
Problem Statement
Multi-agent RL is often too slow to train from scratch in new tasks because the joint state/action space is large and rewards are sparse; we need transfer methods that reuse prior skills, discover useful temporal abstractions automatically, and produce interpretable sub-goals to guide learning.
Main Contribution
A three-stage transfer pipeline: pre-train goal-conditioned RL, finetune in target, then learn temporal embeddings and build a planning graph for sub-goal generation.
A temporal contrastive learning objective (InfoNCE-style) that maps observations to an embedding where geometric distance reflects temporal distance.
A practical execution loop that uses cluster-based planning on the learned graph to pick sub-goals for the finetuned goal-conditioned policy.
Key Findings
Large sample savings: method reaches convergence faster than baselines on Overcooked transfers.
Final task performance matches or exceeds baselines on evaluated tasks.
Extreme sample reduction claim across experiments.
Results
Steps to convergence (Cilantro)
Max soups delivered (Cilantro)
Steps to convergence (Small Corridor)
Max soups delivered (Small Corridor)
Who Should Care
What To Try In 7 Days
Pre-train a goal-conditioned policy on a source layout (use PPO + UVFA).
Collect rollouts in the target env guided by one expert demo and train an InfoNCE contrastive encoder on state pairs within T steps.
Cluster the embeddings, build a transition graph from the demo, and run the finetuned agent with graph-derived sub-goals.
Agent Features
Memory
- Temporal abstraction via embedding (captures short- to mid-horizon structure)
Planning
- Graph-based planning over cluster nodes
- Shortest-path sub-goal selection
Frameworks
- PPO
- UVFA
- InfoNCE contrastive learning
Is Agentic
true
Architectures
- Goal-conditioned policy (UVFA)
- Contrastive encoder for temporal embedding
- Cluster-to-graph planner
Collaboration
- Multi-agent coordination through shared sub-goals
Optimization Features
Training Optimization
- Pre-training on source then finetuning on target to reuse skills
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Validated only in simulated Overcooked layouts; real-world dynamics not tested
- Method requires a single successful demonstration for finetuning and graph construction
- Learned temporal distances are approximations from noisy rollouts and depend on representative data
When Not To Use
- Domains without a clear state observation that matches training rollouts
- Tasks where collecting representative rollouts or a reliable demo is impossible
- Real robots until sim-to-real validation is available
Failure Modes
- Poor or biased expert demo produces bad graphs and misleading sub-goals
- Noisy or unrepresentative rollouts cause clusters that do not reflect true bottlenecks
- Source-policy bias can hinder transfer if pre-training task is misleadingly different
Core Entities
Models
- Goal-conditioned policy (UVFA)
- Proximal Policy Optimization (PPO)
- Temporal contrastive encoder (InfoNCE)
Metrics
- Soups delivered per episode
- Steps to convergence (90% of max performance)
Datasets
- Overcooked (simulated) environment
- Expert demonstration trajectory (single example)
Benchmarks
- Overcooked transfer tasks: Cilantro, Cilantro Left, Small Corridor, Corridor

