Multithreaded xz, with gzip, pv, and pipes - is this the most efficient I can get?

Question

I'm excited to learn that xz now supports multithreading:

xz --threads=0

But now I want to utilise this as much as possible. For example, to recompress gzips as xz:

gzip -d -k -c myfile.gz | pv | xz -z --threads=0 - > myfile.xz

This results in my processor being more highly used (~260% CPU to xz, yay!).

However:

I realise that gzip is not (yet) multithreading,
I think that either pv or the pipes may be restricting the number of (IO?) threads.

Is this true and, if so, is there a way to make this more efficient (other than to remove pv)?

Can you give more details about the scenario? In my answer I point out that the number of .gz files matters a lot. — , Oct 24 '19 at 07:05
In my case it's actually one large disk image being recompressed. I came across this because of the `gzip` 32-bit size reference limit and I wanted my compressed files to show the right uncompressed size. There was a significant improvement in recompression, though, by about 40% compared to without threads, mostly during the nulled part of the uncompressed image where it reached 100MB/s (over USB3.0) according to `pv`. I put this down to lower IO wait times on the `gzip` end due to more "waiting" threads on the `xz` end but I wondered whether the pipe and `pv` were a bottleneck. — tudor -Reinstate Monica-, Oct 24 '19 at 07:24
And did you `time ` the difference at all between -T0 and without? Acoording to my trials it does not make a difference, you loose as much as you win. — , Oct 24 '19 at 07:25
Yes, that's how I got the ~40%, but I didn't keep a copy, sorry. — tudor -Reinstate Monica-, Oct 24 '19 at 07:26
I got over 300%. So this shows that input (pipe) is the big bottleneck, not xz. It is in between my 300% and the 0.1% you get when you slow down the dd pipe with count=10 (only 10 bytes per read). — , Oct 24 '19 at 07:30
I did my tests with a 100MB file on a ramdisk. Your Q is very interseting, so it needs precision, and some testing (timing). — , Oct 24 '19 at 07:33
have you tried `pigz -d` instead of `gzip -d`? that might improve performance a little. `pigz -d` can't decompress in parallel - however, it can run 4 threads at a time (one each for reading, writing, checksum calcs, and decompression). see `man pigz` for details. If it's not packaged for your distribution, you can find pigz at http://zlib.net/pigz/ - in debian etc, `sudo apt-get install pigz` — cas, Oct 24 '19 at 08:17
@cas: "pigz...which can speed up decompression under some circumstances." This must be multiple files on a good system. "specially prepared deflate streams " seem to be the workaround. — , Oct 24 '19 at 09:28
@cas I injstalled pigz and tested - this is really a faster decompression, even though the algorithm itself is on one core, as you explain. Overall decompression is only about 1/4 of the work, so pigz here is only secondary. Thanks to your hint I made some tests, see my answer. — , Oct 24 '19 at 11:09
It might be an idea to use GNU parallel to run several processes at once, then each one might only use a single core/thread, but the ones running simultaneously will probably use different cores/threads. — Henrik supports the community, Sep 19 '22 at 14:25

score 3 · Answer 1 · 2019-10-24T11:12:41.543

With the -T0 multithread option you tell xz two things at once. To use MT also means: wait until all input (data) is read into memory, and then start to compress in "parallel".

After including pigz into my tests I analyze the perfomance step by step; I have a 100M file f100.

$  time xz -c  f100 >/dev/null

real  0m2.658s
user  0m2.573s
sys   0m0.083s

99% of the time is spent compressing on one core. With all four cores activated with -T4 (or -T0)

$  time xz -c -T4 f100 >/dev/null

real  0m0.825s
user  0m2.714s
sys   0m0.284s

Overall result: 300% faster, almost linear per core. The "user" value must be divided by 4, the way it is reported/defined. "sys" now shows some overhead -- real is the sum of 1/4 user plus sys.

$  time gzip     -dc f100.gz >/dev/null
$  time pigz -p4 -dc f100.gz >/dev/null

This is 0.5 vs. 0.2 seconds; when I put all together:

$  time pigz -dc -p4 f100.gz | xz -c -T4 >out.xz

real  0m0.902s
user  0m3.237s
sys   0m0.363s

...it reduces 0.8 + 0.2 = 0.9.

With multiple files, but not too multiple, you can get highest overall parallelism with 4 shell background processes. Here I use four 25M files instead:

for f in f25-?.gz; do time pigz -p4 -dc "$f" | xz -c -T0 >"$f".xz & done

This seems even slightly faster with 0.7s. And even without multithreading, even for xz:

for f in f25-?.gz; do time gzip -dc "$f" | xz -c >"$f".xz & done

just by setting up four simple quarter pipelines with &, you get 0.8s, same as for a 100M file with xz -T4.

In my scenario it is about just as important to activate multithreading in xz than it is to parallelize the whole pipeline; if you can combine this with pigz and/or multiple files, you can even be a bit faster than a quarter of the sum of the single steps.

Multithreaded xz, with gzip, pv, and pipes - is this the most efficient I can get?

1 Answers1