4

I have a question for those of you familiar with the scheduler Slurm. Sometimes I get the following error message slurmstepd: error: Exceeded step memory limit at some point.

I know it means the memory allocated to my process wasn't enough. Nonetheless, the process isn't killed by the scheduler and often times it seems innocuous: The program runs to completion and the output files look in good shape.

Should I always assume that output is faulty and rerun the programs if I get that error message? Why sometimes the allocated memory can be exceeded but the program isn't killed?

j91
  • 161
  • 3

1 Answers1

0

Unless you received a message that the job was killed by SLURM and sacct shows a completed status you should be reasonably able to assume that the job completed.

malex
  • 1
  • 1
  • This is great news, but is there an explanation or references to support it? I've heard a lot of confusing information on this topic and I'd love an answer I can rely on. – Nick S Feb 12 '20 at 22:48