1

Ok so i have a bash function that I apply on several folders:

function task(){
do_thing1
do_thing2
do_thing3
...
}

I want to run that function in parallel. So far I was using a little fork trick:

N=4 #core number
for temp_subj in ${raw_dir}/MRST*
do
  ((i=i%N)); ((i++==0)) && wait
  task "$temp_subj" &
done

And it works great. But I decided to get something 'cleaner' and use GNU parallel:

ls -d ${raw_dir}/MRST* | parallel task {}

Problem is it's putting EVERYTHING in parallel, including the do_thing within my task function. And it is inevitably crashing because those have to be executed in a serial fashion. I tried to modify the call to parallel in many ways but nothing seems to work. Any ideas?

Orchid
  • 11
  • 1
  • @thanasisp I tried many suggestions from that post which are using the gnu parallel but it still won't work. – Orchid Nov 20 '20 at 04:12
  • Your current solution executes the function in batches of 4, that means it's starting 4 of them, waiting for all of them to finish, then starts next 4 etc. If you want to execute at most 4 parallel calls of the function, use `xargs`. You have to `export` the function and use `xargs -n1 -P4` calling a subshell like into this [post](https://unix.stackexchange.com/a/158569/216907). If you want to use parallel, similar thing, like this [post](https://unix.stackexchange.com/a/104008/216907). Also do not use the `ls` output for making the arguments, use `find` or glob expressions. – thanasisp Nov 20 '20 at 20:21
  • @thanasisp thanks a lot for the quick reply and your suggestions. I tried `xargs -n1 -P4` with a subshell but it has the same issue as `parallel`. All the `do_thing` commands within my function are being executed at the same time as if the function is being treated just as a list of commands instead of "as a whole" just like using a `&` would do. – Orchid Nov 20 '20 at 23:02
  • Of course, `task` is the minimum unit that will be executed in your example. When you say in your description that `task "$temp_subj" &` works well, it means you are ok with that. You run 4 tasks in parallel and each task intentionally is not running any of its commands asynchronously. If you mean to parallelize the calls inside `task` then you have to rephrase the question. – thanasisp Nov 20 '20 at 23:08
  • @thanasisp. " If you mean to parallelize the calls inside task then you have to rephrase the question" No I don't and that is my problem because parallel and xargs are doing it by default, or so it seems. – Orchid Nov 21 '20 at 04:08

1 Answers1

1

I think your problem is related to do_thingX:

do_thing() { echo Doing "$@"; sleep 1; echo Did "$@"; }
export -f do_thing
do_thing1() { do_thing 1 "$@"; }
do_thing2() { do_thing 2 "$@"; }
do_thing3() { do_thing 3 "$@"; }
# Yes you can name functions ... - it is a bit unconventional, but it works
...() { do_thing ... "$@"; }
export -f do_thing1
export -f do_thing2
export -f do_thing3
export -f ...

function task(){
  do_thing1
  do_thing2
  do_thing3
  ...
}
export -f task
# This should take 4 seconds for a single input
ls ${raw_dir}/MRST* | time parallel task {}

or you are using a different parallel than GNU Parallel. Check that it is GNU Parallel with:

$ parallel --version
GNU parallel 20201122
Copyright (C) 2007-2020 Ole Tange, http://ole.tange.dk and Free Software
Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: https://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --citation'.
Ole Tange
  • 33,591
  • 31
  • 102
  • 198
  • Thanks a lot Ole for your reply. The do_thingX in my function are actually commands from other toolbox and not functions I made. I actually changed my task function into a .sh file then called: `find ${raw_dir} -name "MRST*" | parallel -j+0 bash mytask.sh {/}` and it's working ! I just thought I could put everything in one file e.g. a function plus the parallel command that calls the function. – Orchid Nov 24 '20 at 18:21