Script to access remote node and get memory usage

Question

I am running a large simulation on a computer cluster using 50 compute nodes. This solver uses a data structure which grows on the fly and (very) differently for each node. I need to make sure the memory used does not grow beyond each node's memory limit.

So far, I am doing it in the most inefficient way: I have one terminal tab open for each node and run top to check the % memory used.

Is there a way I can do it with a script? The idea would be to ssh on each node and store the memory usage, ssh to the next, etc...

What will you do if memory is exceeded? Have you considered `ulimit`? — ctrl-alt-delor, Apr 02 '19 at 08:11
Data structure is a binary tree. I have hardcoded that if a node's tree exceeds a certain number of nodes, it is restarted from scratch. That limit is put much lower to ensure I am never close to the memory limit. Also, isn't `ulimit` for the stack? My problem here is the heap — solalito, Apr 02 '19 at 08:16
what about a loop with `ssh «remote-machine» cat /mem/info` — ctrl-alt-delor, Apr 02 '19 at 11:16

score 0 · Answer 1 · answered Apr 02 '19 at 08:19

If you just want to kill a process that gets too big, then ulimit is your friend.

From the manual:

  -S        use the `soft' resource limit
  -H        use the `hard' resource limit
  -a        all current limits are reported
  -b        the socket buffer size
  -c        the maximum size of core files created
  -d        the maximum size of a process's data segment
  -e        the maximum scheduling priority (`nice')
  -f        the maximum size of files written by the shell and its children
  -i        the maximum number of pending signals
  -l        the maximum size a process may lock into memory
  -m        the maximum resident set size
  -n        the maximum number of open file descriptors
  -p        the pipe buffer size
  -q        the maximum number of bytes in POSIX message queues
  -r        the maximum real-time scheduling priority
  -s        the maximum stack size
  -t        the maximum amount of cpu time in seconds
  -u        the maximum number of user processes
  -v        the size of virtual memory
  -x        the maximum number of file locks
  -T    the maximum number of threads

This was actually not the crucial point of my question. I want to know how can I retrieve this information from a remote access. E.g. From the main node, I run a script which ssh'es on each of the node on a host list, retrieve the memory usage of that node, and so on. The information is then displayed from the main node. — solalito, Apr 02 '19 at 10:46

score 0 · Answer 2 · answered Apr 02 '19 at 10:51

I need to make sure the memory used does not grow beyond each node's memory limit.

Would it make sense to use --memfree in GNU Parallel? If the system does not have 2 GB free, the job will not start. If the system get less than 1 GB free it will be killed.

parallel --slf hosts.txt --memfree 2G -j1 job ::: ar gu ments

Script to access remote node and get memory usage

2 Answers2