I need to run grep on a couple of million files. Therefore I tried to speed it up, following the two approaches mentioned here: xargs -P -n and GNU parallel. I tried this on a subset of my files (9026 in number), and this was the result:
With
xargs -P 8 -n 1000, very fast:$ time find tex -maxdepth 1 -name "*.json" | \ xargs -P 8 -n 1000 grep -ohP "'pattern'" > /dev/null real 0m0.085s user 0m0.333s sys 0m0.058sWith
parallel, very slow:$ time find tex -maxdepth 1 -name "*.json" | \ parallel -j 8 grep -ohP "'pattern'" > /dev/null real 0m21.566s user 0m22.021s sys 0m18.505sEven sequential
xargsis faster thanparallel:$ time find tex -maxdepth 1 -name "*.json" | \ xargs grep -ohP 'pattern' > /dev/null real 0m0.242s user 0m0.209s sys 0m0.040s
xargs -P n does not work for me because the output from all the processes gets interleaved, which does not happen with parallel. So I would like to use parallel without incurring this huge slowdown.
Any ideas?
UPDATE
Following the answer by Ole Tange, I tried
parallel -X, the results are here, for completeness:$ time find tex -maxdepth 1 -name "*.json" | \ parallel -X -j 8 grep -ohP "'pattern'" > /dev/null real 0m0.563s user 0m0.583s sys 0m0.110sFastest solution: Following the comment by @cas, I tried to grep with
-Hoption (to force printing the filenames), and sorting. Results here:time find tex -maxdepth 1 -name '*.json' -print0 | \ xargs -0r -P 9 -n 500 grep --line-buffered -oHP 'pattern' | \ sort -t: -k1 | cut -d: -f2- > /dev/null real 0m0.144s user 0m0.417s sys 0m0.095s