22

I'm using xargs with the option --max-args=0 (alternatively -P 0).

However, the output of the processes is merged into the stdout stream without regard for proper line separation. So I'll often end up with lines such as:

<start-of-line-1><line-2><end-of-line-1>

As I'm using egrep with ^ in my pattern on the whole xargs output this is messing up my result.

Is there some way to force xargs to write the process outputs in order (any order, as long as the output of one process is contiguous)?

Or some other solution?

Edit: more details about the use case:

I want to download and parse web pages from different hosts. As every page takes about a second to load and there are a few dozen pages I want to parallelize the requests.

My command has the following form:

echo -n $IPs | xargs --max-args=1 -I {} --delimiter ' ' --max-procs=0 \
wget -q -O- http://{}/somepage.html | egrep --count '^string'

I use bash and not something like Perl because the host IPs (the $IPs variable) and some other data comes from an included bash file.

Jakuje
  • 20,974
  • 7
  • 51
  • 70
Christoph Wurm
  • 5,678
  • 5
  • 25
  • 28

2 Answers2

18

GNU Parallel is specifcally designed to solve this problem:

echo -n $IPs | parallel -d ' ' -j0 wget -q -O- http://{}/somepage.html | egrep --count '^string'

If your IPs are in a file it is even prettier:

cat IPs | parallel -j0 wget -q -O- http://{}/somepage.html | egrep --count '^string'

To learn more watch the intro video: http://www.youtube.com/watch?v=OpaiGYxkSuQ

Ole Tange
  • 33,591
  • 31
  • 102
  • 198
  • 3
    Nice tool! Also, I'm betting that someone will tell you that cat is useless very soon. – Stéphane Gimenez Jul 30 '11 at 19:31
  • 1
    I know. But I find it easier to read, and I usually work on 48 core machines, so the few extra clock cycles for one of the idle cores has yet to be a problem. – Ole Tange Jul 30 '11 at 19:55
  • parallel would be perfect for the job if it was in the Debian repositories. – Christoph Wurm Jul 31 '11 at 10:41
  • 1
    @Legate Debian includes the `parallel` command from [moreutils](http://kitenet.net/~joey/code/moreutils/), which is sufficient here: `parallel -j99 -i sh -c 'wget -q -O- http://{}/somepage.html | egrep -c "^string"' -- $IPs` – Gilles 'SO- stop being evil' Jul 31 '11 at 22:05
  • @Legate checkout https://build.opensuse.org/package/binaries?package=parallel&project=home%3Atange&repository=Debian_5.0 for a .deb file and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=518696 for the bug to push. – Ole Tange Aug 03 '11 at 21:21
  • @Giles the `parallel` in moreutils is not GNU Parallel - which is the reason you need to do the `-i sh -c` circus. – Ole Tange Aug 03 '11 at 21:23
  • @Ole Thanks for the links. I read through the bug report, hopefully `parallel` will make it into the repositories soon. Unfortunately, I won't be able to use the deb from the OpenSUSE Build Service, as introducing a third-party package on the machine I need this on is not desirable. – Christoph Wurm Aug 04 '11 at 10:26
  • for me moreutils parallel didn't protect from mixing stdout, I had to install GNU version. – MateuszL Feb 24 '21 at 10:41
7

This should do the trick:

echo -n $IPs | xargs --max-args=1 -I {} --delimiter ' ' --max-procs=0 \
  sh -c "wget -q -O- 'http://{}/somepage.html' | egrep --count '^string'" | \
  { NUM=0; while read i; do NUM=$(($NUM + $i)); done; echo $NUM; }

The idea here is to make separate counts and sum these at the end. Might fail if the separate counts are big enough to be mixed, but it should not be the case.

Stéphane Gimenez
  • 28,527
  • 3
  • 76
  • 87