3

I have large batches of bash processes. Each bash script invokes executeables which have their stdout redirected to distinct log files. About 5% of the runs end up with: sh: [name of log]: Resource temporarily unavailable I tried to reduce amount of jobs running in parallel, but still the error persisted on some of the bash scripts.

Additional info:

  • Ubuntu 14.04 LTS running on VM using ESXi
  • Happens on a new partition, allocated with gparted and LVM (new logical volume consisting of the entire partition)
  • The LV is exported using nfs-kernel-server
  • The LV is also shared to windows using Samba
  • The LV is formatted using ext4
  • I have admin rights on this machine

More detailed info

  • Everything is run in a cluster, using Sun-Grid-Engine
  • There are 4 virtual machines: m1, m2, m3, m4
  • m1 runs sge master, sge exec, and ldap server
  • m2, m3, m4 run sge exec
  • m3 runs nfs-kernel-server, exporting a home folder sitting in logical volume (using LVM) that uses a partition on a local disk, to m1, m2, m4
  • m3 has a soft link to the home folder
  • m1, m2, m4 mount the home folder through fstab, so all machines end up pointing to the same home folder
  • m3, m2, m4 run ldap clients, connecting to m1
  • All jobs are submitted to the cluster through m1 (configured as a submission host)
  • Jobs fail exclusively on m3 (which exports the disk). Most of the jobs on m3 are passing though. Failures are random, but consistently on m3 alone.
  • m3 also shares the home via samba to windows clients

Any help would be greatly appreciated :) (how to debug, which logs are relevant, how to get more info out of the system, etc...)

Thank you in advance!

lev haikin
  • 131
  • 6
  • 1
    I think you have reached the max number of open file descriptor, you can check by `sysctl fs.file-nr` during a batch processing. – alexises Dec 31 '14 at 08:02
  • @alexises Thanks for you comment. A dry run (without actually running the jobs) gave me quite large numbers: 8240 0 3272038. Do you think it really can all be used?.. BTW, similar jobs run on other machines in parallel (using SGE), which access the disk through nfs mounts, and never have this problem. This only happens on jobs allocated to run on the host nfs machine. – lev haikin Dec 31 '14 at 08:19
  • I agree with alexises, could you post the output of ulimit -n – Thorsten Staerk Dec 31 '14 at 08:24
  • @ThorstenStaerk ulimit -n gave 1024. While running the jobs, sysctl fs.file-nr gave numbers as high as 9200. – lev haikin Dec 31 '14 at 08:38
  • that means you can have up to 1024 open files. Run ulimit -a to see. Run lsof to see which files are open right now. Try setting ulimit -n 8096 and see if the situation improves. – Thorsten Staerk Dec 31 '14 at 09:11
  • @ThorstenStaerk Did you mean 8192 (or it shouldn't be 2^)? I increased to 8192, and I'm still getting 1 consistent failure (which is an improvement). lsof | wc -l return 49853, and ~50500 at peaks overall. I'm guessing that I'm only interested in the lsof | grep | wc -l? that gave ~2500 at peaks. I increased even more, up to 16384 - but I got 4 failures now. What I don't understand is that the peaks are much lesser than the limit, so why does it consistently fail when I had limit of 8192? – lev haikin Dec 31 '14 at 09:33
  • I realized now that ulimit -n only affects current shell, and shells spawned from current shells. That means it didn't have any affect on my jobs because they are invoked by SGE (Sun-Grid-Engine). I think I need to either add this to /etc/security/limits.conf (need to reboot?), or add a "ulimit -n 16384" to /etc/init.d/gridengine-exec somewhere before it starts. Will try that out. – lev haikin Dec 31 '14 at 10:53
  • no luck. Still one/two jobs fail :( – lev haikin Dec 31 '14 at 11:00
  • yes you need to reboot. Also you checked that the new value is active now, right? – Thorsten Staerk Jan 01 '15 at 14:43
  • http://stackoverflow.com/questions/23803182/bash-fork-retry-resource-temporarily-unavailable suggests you are running out of sub-tasks or out of memory... or out of max allowed locked memory - try ulimit -a to find out about max locked memory. – Thorsten Staerk Jan 01 '15 at 14:49
  • @ThorstenStaerk Thanks. I already increased ulimit -l unlimited. I also saw lots of documentation regarding fork limit. This doesn't seem the case here though, because when it's a fork issue, then it specifically states **fork**: _resource temporarily unavailable_... – lev haikin Jan 01 '15 at 18:46
  • then use sar or vmstat to monitor how much memory is available when the issue happens – Thorsten Staerk Jan 01 '15 at 18:54
  • Can your shell script detect when it has failed to invoke one of these executables, for example by checking `$?` ? If so, can you have the shell run `ls /proc/$$/fd | wc -l` when this happens? – Mark Plotnick Jan 02 '15 at 11:36
  • counting file-descriptors gave low numbers (~6). Also, sar gave results similar to lsof. In addition, I invoked fuser on the logs to perhaps catch contentions, but there was always only one process trying to write to each log. I also tried mount on m3 instead of soft link. Issue persists. – lev haikin Jan 04 '15 at 11:41

1 Answers1

0

Thanks everyone who tried to help!

The issue got solved by mounting the logical volume on m3 using nfs exactly the same way as done on the rest of the machines m1/m2/m4 which are nfs clients, instead of having soft link on m3 to the logical volume. Simply add the following line to /etc/fstab: <nfs server>:/ /mnt nfs auto 0 0 and then invoke sudo mount -a.

The hint was in the fact that failures were happening consistently on m3 which is the nfs server, and automatic resubmit of failed jobs solved the issue as well. Never were there failures on m1/m2/m4 (which are the nfs clients). Remember that m3 is nfs server, which has a simple soft link to the logical volume, while all clients use nfs to connect this logical volume.

In the back of my head I had the feeling that probably nfs protects it's clients from having these issues, but I thought that the filesystem on the logical-volume shouldn't fail, and if it does then I have a real problem that I must root cause. This might be still the case btw.

If you have insights about the issue, and the solution - please write. I don't want to mask issues if they are real.

lev haikin
  • 131
  • 6