4

I am trying to setup a cluster of four nodes (all running Fedora 22) with OpenMPI.

On the master node, I've created a password-less key (~/.ssh/id_dsa) and copied ~/.ssh/id_dsa.pub to each of the three slave nodes' ~/.ssh/authorized_keys. So, from the master node, I can run ssh slave1, ssh slave2, or ssh slave3 and successfully get into the corresponding node, without being asked for a password. Same goes for ssh master.

However, I run into permission problems when I try to use mpirun. Here is the command I run:

/usr/lib64/openmpi/bin/mpirun -np 32 --hostfile .mpi_hostfile ./testprogram

and here is the first bit of the output:

Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
ORTE was unable to reliably start one or more daemons.

When I subsequently run ssh slave3, I see the message "There were 2 failed login attempts since the last successful login." So it looks like the ssh authentication that mpirun is trying to do is failing for some reason.

Any ideas why I can do my password-less, key-based authentication just fine with ssh, but not with mpirun?

For the record, here is the contents of .mpi_hostfile:

# Host file for OpenMPI

# Master node, slots = num cores
localhost slots=8

# Slaves
slave1 slots=8
slave2 slots=8
slave3 slots=8
davewy
  • 153
  • 1
  • 9
  • 1
    Try running sshd with debugging enabled on one of the slave nodes. You can do this by adding a line `LogLevel DEBUG3` to its `/etc/ssh/sshd_config`, and restarting the daemon. The log entries will show up in syslog. If they are not enlightening, post them. – Wouter Verhelst Sep 07 '15 at 21:40

1 Answers1

4

This is likely because Open MPI defaults to using a tree-based launching scheme. E.g., ssh from the machine where you invoke mpirun to slave1, and then ssh from slave1 to slave2, ...etc.

See http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi and http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi-part-2 for more details.

Jeff Squyres
  • 156
  • 2
  • Thank you Jeff, this was exactly the info needed to debug this problem. One of the slaves didn't have the ssh key of another slave in authorized_keys, so that one link didn't work. Those two articles are a great resource and explain the issue I was having. – davewy Sep 11 '15 at 20:15
  • Cool. Please upvote the answer so that others know that this was the correct solution. Thanks! – Jeff Squyres Sep 12 '15 at 00:10