Executing piped commands in parallel

Question

Consider the following scenario. I have two programs A and B. Program A outputs to stdout lines of strings while program B process lines from stdin. The way to use these two programs is of course:

foo@bar:~$ A | B

Now I've noticed that this eats up only one core; hence I am wondering:

Are programs A and B sharing the same computational resources? If so, is there a way to run A and B concurrently?

Another thing that I've noticed is that A runs much much faster than B, hence I am wondering if could somehow run more B programs and let them process the lines that A outputs in parallel.

That is, A would output its lines, and there would be N instances of programs B that would read these lines (whoever reads them first) process them and output them on stdout.

So my final question is:

Is there a way to pipe the output to A among several B processes without having to take care of race conditions and other inconsistencies that could potentially arise?

While `A | B | C` is parallel as in separate processes, due to the nature of pipes (B has to wait for output of A, C has to wait for output of B) it may still be linear in some cases. It entirely depends on what kind of output they produce. There aren't many cases where running multiple `B` would help much, it's entirely possible that the parallel wc example is slower than regular `wc` as splitting may take more resources than counting lines normally. Use with care. — frostschutz, Jun 15 '13 at 11:39

Ole Tange · Accepted Answer · 2021-07-08T18:56:42.040

22

A problem with split --filter is that the output can be mixed up, so you get half a line from process 1 followed by half a line from process 2.

GNU Parallel guarantees there will be no mixup.

So assume you want to do:

 A | B | C

But that B is terribly slow, and thus you want to parallelize that. Then you can do:

A | parallel --pipe B | C

GNU Parallel by default splits on \n and a block size of 1 MB. This can be adjusted with --recend and --block.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:

$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
   fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

edited Jul 08 '21 at 18:56

answered Jun 15 '13 at 12:27

Ole Tange

33,591
31
102
198

2

While I strongly disagree on the installation method :-), +1 because your solution solves most of the problems with mine. – LSerni Jun 15 '13 at 12:46
This one is nice indeed. Do you also have any suggestions for the parameters to be used? I know program A will output more than 1TB of data approx 5GB per minute. The program B processes data 5 times slower than A outputs it and I have 5 cores at my disposal for this task. – Jernej Jun 15 '13 at 12:50
GNU Parallel can currently at most handle around 100 MB/s, so you are going to touch that limit. The optimal `--block-size` will depend on the amount of RAM and how fast you can start a new `B`. In your situation I would use `--block 100M` and see how that performed. – Ole Tange Jun 15 '13 at 13:11
@lserni Can you come up with an installation method that is better, which works on most UNIX machines and requires similar amount of work from the user? – Ole Tange Jun 15 '13 at 13:15
4

Sorry, I did not make myself clear. The installation method - the script passed to `sh` - is great. The problem lies in passing it to sh: *downloading and running executable code from a site*. Mind you, maybe I'm just being too paranoid, since one could object that a custom-made RPM or DEB is basically the same thing, and even posting the code on a page to be copied and pasted would result in people doing so blindly anyway. – LSerni Jun 15 '13 at 13:43
@OleTange After running a program with parallel for some time (10 days) I got the following message : parallel: Warning: No more processes: Decreasing number of running jobs to 1. Raising ulimit -u may help. Now the program is not ran in parallel anymore. What is going on, is there any way to fix this on fly? – Jernej Aug 13 '13 at 09:46

score 14 · Answer 2 · edited Jun 15 '13 at 13:44

When you write A | B, both processes already run in parallel. If you see them as using only one core, that's probably because either of CPU affinity settings (perhaps there is some tool to spawn a process with different affinity) or because one process isn't enough to hold a whole core, and the system "prefers" not to spread out computing.

To run several B's with one A, you need a tool such as split with the --filter option:

A | split [OPTIONS] --filter="B"

This, however, is liable to mess up the order of lines in the output, because the B jobs won't be running all at the same speed. If this is an issue, you might need to redirect B i-th output to an intermediate file and stitch them together at the end using cat. This, in turn, may require a considerable disk space.

Other options exist (e.g. you could limit each instance of B to a single line-buffered output, wait until a whole "round" of B's has finished, run the equivalent of a reduce to split's map, and cat the temporary output together), with varying levels of efficiency. The 'round' option just described for example will wait for the slowest instance of B to finish, so it will be greatly dependent on the available buffering for B; [m]buffer might help, or it might not, depending on what the operations are.

Examples

Generate the first 1000 numbers and count the lines in parallel:

seq 1 1000 | split -n r/10 -u --filter="wc -l"
100
100
100
100
100
100
100
100
100
100

If we were to "mark" the lines, we'd see that each first line is sent to process #1, each fifth line to process #5 and so on. Moreover, in the time it takes split to spawn the second process, the first is already a good way into its quota:

seq 1 1000 | split -n r/10 -u --filter="sed -e 's/^/$RANDOM - /g'" | head -n 10
19190 - 1
19190 - 11
19190 - 21
19190 - 31
19190 - 41
19190 - 51
19190 - 61
19190 - 71
19190 - 81

When executing on a 2-core machine, seq, split and the wc processes share the cores; but looking closer, the system leaves the first two processes on CPU0, and divides CPU1 among the worker processes:

%Cpu0  : 47.2 us, 13.7 sy,  0.0 ni, 38.1 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 15.8 us, 82.9 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.3 hi,  0.0 si,  0.0 st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM     TIME+ COMMAND
 5314 lserni    20   0  4516  568  476 R 23.9  0.0   0:03.30 seq
 5315 lserni    20   0  4580  720  608 R 52.5  0.0   0:07.32 split
 5317 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.86 wc
 5318 lserni    20   0  4520  572  484 S 14.0  0.0   0:01.88 wc
 5319 lserni    20   0  4520  576  484 S 13.6  0.0   0:01.88 wc
 5320 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.85 wc
 5321 lserni    20   0  4520  572  484 S 13.3  0.0   0:01.84 wc
 5322 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.86 wc
 5323 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.86 wc
 5324 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.87 wc

Notice especially that split is eating a considerable amount of CPU. This will decrease in proportion to A's needs; i.e., if A is a heavier process than seq, the relative overhead of split will decrease. But if A is a very lightweight process and B is quite fast (so that you need no more than 2-3 B's to keep along with A), then parallelizing with split (or pipes in general) might well not be worth it.

Interesting that the split found on Ubuntu does not have the --filter option. What kind of OS are using for this? — Jernej, Jun 15 '13 at 11:15
Linux OpenSuSE 12.3, with coreutils ( http://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html ). I'll try and get hold of an Ubuntu, they might have changed the name to accommodate some similarly-named tool. — LSerni, Jun 15 '13 at 11:20
Are you sure about the `split` `--filter` option missing? On my Ubuntu 12.04-LTS ("wheezy/sid"), it is there, and my examples do work. Could you have installed a different `split` than the one in GNU coreutils? — LSerni, Jun 15 '13 at 11:27
Thanks for this. I had to install a newer version of Coreutils. BTW, I have noticed that if I run program A alone it eats an entire core (100%) if I run A | B then they together eat an entire core, process A eating 15% and process B eating 85%.. Do you happen to see why this is so? — Jernej, Jun 15 '13 at 12:36
This is likely because of *blocking*. If B is heavier than A, then A can't send its output and is slowed down. Another possibility is A *yielding* to B during its operation (e.g. disk/net). On a different system, you might see B gobbling 100% of CPU1 and A being assigned 18% of CPU0. You probably need 85/15 ~ 5.67 = between 5 and 6 instances of B to get a single A instance to saturate a single core. I/O, if present, might skew these values, though. — LSerni, Jun 15 '13 at 12:45

Executing piped commands in parallel

2 Answers2

Examples

Linked