I have a question for those of you familiar with the scheduler Slurm. Sometimes I get the following error message slurmstepd: error: Exceeded step memory limit at some point.
I know it means the memory allocated to my process wasn't enough. Nonetheless, the process isn't killed by the scheduler and often times it seems innocuous: The program runs to completion and the output files look in good shape.
Should I always assume that output is faulty and rerun the programs if I get that error message? Why sometimes the allocated memory can be exceeded but the program isn't killed?