1

Is it possible to run a slurm job on already logged on nodes of a cluster? Suppose I am already logged on nodes:

casade01
casade02
casade03

so that I don't need to wait in a queue. If it matters, I am able to ssh into specific nodes, like ssh user@casacde01. Can I then eg login to a node that I designate as a 'head' node, and say, 'okay, run this code on casacde02 and casacde03?'

I was looking at this stackexchange post where it gives the -w flag for sbatch. But do I need sbatch, or something else?

If I logon to an individual node, and a run something, then it would just run on that node, so I need to invoke slurm to schedule a paralllel program to run on all the currently logged in nodes somehow.

georg
  • 131
  • 4
  • I think you are missing the point of using slurm. – user10489 Oct 03 '22 at 23:24
  • Slurm is the scheduling software which sends jobs to the cluster based on the resources that are requested in order to make sure that they are sent to nodes with enough available resources and that priority is applied to the jobs. There is no way to use `sbatch` to run jobs on nodes where you are already logged in via `ssh` and you can't designate one of the nodes as a head node as that is determined by the cluster managed. Allowing someone to `ssh` directly into the nodes and run jobs defeats the purpose of all of this so I've no idea why it's configured like that. – Nasir Riley Oct 04 '22 at 00:31
  • So okay slurm is not useful here, but how can one designate code to run on already logged in nodes? I thought that I would get such an answer as 'oh you misunderstand what slurm is for', without addressing the original goal of wanting to run a parallel program on selected nodes, after being logged in. There are many situations where that would be useful without gatekeeping cluster use! – georg Oct 04 '22 at 09:22
  • It's not clear what you're trying to do which is why you didn't get an answer. Why do you need to be logged into nodes? The point of slurm is that it will select nodes for you and start parallel jobs without being logged in, which may or may not be the opposite of what you want. – user10489 Oct 04 '22 at 09:36
  • I am debugging code and/or running interactively on a small number of nodes. If there is a crash I do not need to wait in the queue this way – georg Oct 04 '22 at 09:49
  • I just reread this "why do you need to be logged into nodes" and it's quite irritating. Why are you trying to formulate what I want to do, this is just dismissive and know-it-all-ish. – georg Oct 04 '22 at 17:41
  • We're not trying to "formulate" what you want to do, we're trying to figure out what you are doing so we can provide a solution. Either slurm is the wrong solution, or you are using slurm wrong, can't tell which, because nothing you've asked so far made any sense. – user10489 Oct 05 '22 at 00:05

1 Answers1

0

It's not completely clear to me what you are trying to do, but I'll make a few assumptions and attempt an answer. I will take your comment that mentions "debugging code and/or running interactively", as the basis of what you are trying to do (you might want to add that to your question).

If you are willing to wait in the queue for the initial allocation of your job, but then be able to debug interactively once the job is started, then there are SLURM commands that will allow you to do so.

For example, if you need 3 nodes to debug your code, you could use the slurm command salloc -N 3 which (depending on your configuration) will allocate you 3 nodes, possibly (again depending on the slurm config) give you a prompt on one of those nodes, and then you can use srun to run your parallel code. You can keep running srun commands until you're done debugging (or until your time runs out).

Now let's say you want three specific nodes, you can use the same salloc command, but add --nodelist=casade01,casade02,casade03 to the command.

If however, you were already logged into those three nodes (e.g. with ssh, and not within slurm), and you wanted to specifically use those three login sessions to run your commands, then you should be aware that you may be interfering with other jobs that are being scheduled by slurm. Often, slurm configurations are setup so that you cannot login directly to compute nodes without using slurm commands, but in your setup that does not seem to be the case. The slurm srun command is likely (depending on your setup) using some type of MPI to run your parallel code. You could use mpi commands directly to run your code. If you are not familiar with MPI commands for executing code (e.g. mpiexec), then I would not take this route, especially if the salloc method works.

DericS
  • 691
  • 2
  • 5