7

On an HPC cluster I am trying to run multiple bash scripts (permute2.sh) from 1 bash script using GNU parallel, however it doesn't complete every job. It randomly completes one job, while it is stuck doing the other.

permute1.sh:

PROCS=144 
permuations=1000
seq 1 $permuations | parallel -j $PROCS sh permute2.sh {}

permute2.sh (taking 100 random lines from a file and performs some actions on it for permutation)

id=$1
randomlines=100
awk 'BEGIN{srand();} {a[NR]=$0}
END{for(I=1;I<='$randomlines';I++){x=int(rand()*NR);print a[x];}}'
FILE.txt > results/randomlines.$id.txt

# do stuff with randomlines.$id.txt.. 

When I run permute1.sh I can see it creates 144 files, for each cpu 1 (randomlines.1.txt - randomlines.144.txt), but most of them are empty and stopped working, and some are completed. What am I doing wrong?

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
tafelplankje
  • 303
  • 2
  • 6
  • Are you running this job as part of a batch file on your cluster? Are there limitations set by the resource manager that limits the number of processes or files you can execute? – jsbillings Dec 18 '12 at 19:57
  • 3
    I tested your programs on my laptop and it finished with no problems. So it is likely a problem limited to your system. My guess would be file handles. Use --joblog logfile to see what jobs failed. Use --retries 3 to try a failed job 3 times. – Ole Tange Dec 20 '12 at 11:42
  • 1
    When i just 'qsub' the jobs, they all get completed perfectly. so i got it working now... There must be some limitations indeed, but I'm not getting a response from the helpdesk (its a PBS system btw). i'll just not use parallel for these occasions.. – tafelplankje Mar 15 '13 at 18:32

1 Answers1

1

Your ulimit -u is < 144. Have an admin change that.

Michael Mrozek
  • 91,316
  • 38
  • 238
  • 232
user39122
  • 17
  • 4