0

On our cluster we use LMOD to load specific pre-installed modules dynamically (like PyTorch or some other scientific packages). On top of that I want to run some code with the DeepSpeed framework which allows for optimisations to run distributed code across nodes. Under the hood it uses pdsh. The issue that I have is that the ssh session's of course do not load the modules that I have already loaded in the main node - but that leads to problems because it can then not find some needed libraries such as Python.

As an example: Let's say that I request an interactive SLURM job with multiple nodes. In the main node I load the modules PyTorch+Python and pdsh

module load PyTorch/1.12.0-foss-2022a-CUDA-11.7.0
module load pdsh/2.34-GCCcore-11.3.0

Then, I can run some deepspeed command, which will launch parallel ssh sessions to all the nodes. But, because those are new sessions on those nodes, the modules specified above are not loaded. It was suggested to add these module load commands to my .bashrc but that would mean that they are always loaded, which I may not want.

I'm therefore looking for a way to detech whether a session is set by pdsh. Does pdsh set some variables that I can use in my .bashrc so that I only module load when that condition is true?

Bram Vanroy
  • 183
  • 1
  • 11
  • Would it be too much overhead to have `[[ -n "$SLURM_JOBID" ]] && module load ...` in your `.bashrc`. – doneal24 Jan 20 '23 at 20:36
  • @doneal24 Thanks for the idea. I solved it currently by loading modules depending on the hostname. This works well. – Bram Vanroy Jan 26 '23 at 10:36

1 Answers1

0

Inspired by doneal24's comment, I solved my issue by loading the modules only when we ssh into a specific node that has a hostname that starts with gpu.

if [[ $(hostname) == gpu* ]]; then
    module load PyTorch/1.12.0-foss-2022a-CUDA-11.7.0;
    module load pdsh/2.34-GCCcore-11.3.0;
fi
Bram Vanroy
  • 183
  • 1
  • 11