25

I have a bunch of PNG images on a directory. I have an application called pngout that I run to compress these images. This application is called by a script I did. The problem is that this script does one at a time, something like this:

FILES=(./*.png)
for f in  "${FILES[@]}"
do
        echo "Processing $f file..."
        # take action on each file. $f store current file name
        ./pngout -s0 $f R${f/\.\//}
done

Processing just one file at a time, takes a lot of time. After running this app, I see that the CPU is just 10%. So I discovered that I can divide these files in 4 batches, put each batch in a directory and fire 4, from four terminal windows, four processes, so I have four instances of my script, at the same time, processing those images and the job takes 1/4 of the time.

The second problem is that I lost time dividing the images and batches and copying the script to four directories, open 4 terminal windows, bla bla...

How do that with one script, without having to divide anything?

I mean two things: first how do I from a bash script, fire a process to the background ? (just add & to the end?) Second: how do I stop sending tasks to the background after sending the fourth tasks and put the script to wait until the tasks end? I mean, just sending a new task to the background as one tasks end, keeping always 4 tasks in parallel? if I do not do that the loop will fire zillions of tasks to the background and the CPU will clog.

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
Duck
  • 4,434
  • 19
  • 51
  • 64

4 Answers4

34

If you have a copy of xargs that supports parallel execution with -P, you can simply do

printf '%s\0' *.png | xargs -0 -I {} -P 4 ./pngout -s0 {} R{}

For other ideas, the Wooledge Bash wiki has a section in the Process Management article describing exactly what you want.

jw013
  • 50,274
  • 9
  • 137
  • 141
  • 2
    There are also "gnu parallel" and "xjobs" designed for this case. It's mostly a matter of taste which you prefer. – wnoise Apr 01 '12 at 04:41
  • Could you please explain the proposed command? Thanks! – Eugene S Apr 01 '12 at 07:25
  • 1
    @EugeneS Could you be a bit more specific about what part? The printf collects all png files and passes them via a pipe to xargs, which collects arguments from standard input and combines them into arguments for the `pngout` command the OP wanted to run. The key option is `-P 4`, which tells xargs to use up to 4 concurrent commands. – jw013 Apr 01 '12 at 10:05
  • 2
    Sorry for not being precise. I was specifically interested why did you use `printf` function here rather than just regular `ls .. | grep .. *.png`? Also I was interested in the `xargs` parameters you used (`-0` and `-I{}`). Thanks! – Eugene S Apr 01 '12 at 11:12
  • 4
    @EugeneS It's for maximum correctness and robustness. File names are not lines, and [`ls` cannot be used to parse filenames portably and safely](http://mywiki.wooledge.org/ParsingLs). The only safe characters to use to delimit file names are `\0` and `/`, since every other character, including `\n`, can be part of the file name itself. The `printf` uses `\0` to delimit file names, and the `-0` informs `xargs` of this. The `-I{}` tells `xargs` to replace `{}` with the argument. – jw013 Apr 01 '12 at 18:23
  • Great! Thank you for your detailed explanation! – Eugene S Apr 02 '12 at 09:36
  • @jw013 - just one question: this answer you posted will just fire another 4 processes when all the original processes are done or will one new process be fired for each one that is done? – Duck Apr 02 '12 at 13:39
  • @DigitalRobot You can try it for yourself but the `xargs` man page seems to imply it will try to keep 4 processes running at all times. – jw013 Apr 02 '12 at 17:43
9

In addition to solutions already proposed, you can create a makefile that describes how to make a compressed file from uncompressed, and use make -j 4 to run 4 jobs in parallel. The problem is that you will need to name compressed and uncompressed files differently, or store them in different directories, else writing a reasonable make rule will be impossible.

9000
  • 1,619
  • 1
  • 13
  • 23
8

If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:

parallel ./pngout -s0 {} R{} ::: *.png

You can install GNU Parallel simply by:

wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem

Watch the intro videos for GNU Parallel to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Ole Tange
  • 33,591
  • 31
  • 102
  • 198
7

To answer your two questions:

  • yes, adding & at the end of the line will instruct you shell to launch a background process.
  • using the wait command, you can ask the shell to wait for all the processes in the background to finish before proceeding any further.

Here's the script modified so that j is used to keep track of the number of background processes. When NB_CONCURRENT_PROCESSES is reached, the script will reset j to 0 and wait for all the background processes to finish before resuming it's execution.

files=(./*.png)
nb_concurrent_processes=4
j=0
for f in "${files[@]}"
do
        echo "Processing $f file..."
        # take action on each file. $f store current file name
        ./pngout -s0 "$f" R"${f/\.\//}" &
        ((++j == nb_concurrent_processes)) && { j=0; wait; }
done
jw013
  • 50,274
  • 9
  • 137
  • 141
Frederik Deweerdt
  • 3,722
  • 17
  • 18
  • 1
    This will wait for the last of the four concurrent processes and will then start a set of another four. Perhaps one should build an array of four PIDs and then wait for these specific PIDs? – Nils Mar 31 '12 at 19:59
  • Just to explain my fixes to the code: (1) As a matter of style, avoid all uppercase variable names as they potentially conflict with internal shell variables. (2) Added quoting for `$f` etc. (3) Use `[` for POSIX compatible scripts, but for pure bash `[[` is always preferred. In this case, `((` is more appropriate for the arithmetic. – jw013 Mar 31 '12 at 20:30