With the -T0 multithread option you tell xz two things at once. To use MT also means: wait until all input (data) is read into memory, and then start to compress in "parallel".
After including pigz into my tests I analyze the perfomance step by step; I have a 100M file f100.
$ time xz -c f100 >/dev/null
real 0m2.658s
user 0m2.573s
sys 0m0.083s
99% of the time is spent compressing on one core. With all four cores activated with -T4 (or -T0)
$ time xz -c -T4 f100 >/dev/null
real 0m0.825s
user 0m2.714s
sys 0m0.284s
Overall result: 300% faster, almost linear per core. The "user" value must be divided by 4, the way it is reported/defined. "sys" now shows some overhead -- real is the sum of 1/4 user plus sys.
$ time gzip -dc f100.gz >/dev/null
$ time pigz -p4 -dc f100.gz >/dev/null
This is 0.5 vs. 0.2 seconds; when I put all together:
$ time pigz -dc -p4 f100.gz | xz -c -T4 >out.xz
real 0m0.902s
user 0m3.237s
sys 0m0.363s
...it reduces 0.8 + 0.2 = 0.9.
With multiple files, but not too multiple, you can get highest overall parallelism with 4 shell background processes. Here I use four 25M files instead:
for f in f25-?.gz; do time pigz -p4 -dc "$f" | xz -c -T0 >"$f".xz & done
This seems even slightly faster with 0.7s. And even without multithreading, even for xz:
for f in f25-?.gz; do time gzip -dc "$f" | xz -c >"$f".xz & done
just by setting up four simple quarter pipelines with &, you get 0.8s, same as for a 100M file with xz -T4.
In my scenario it is about just as important to activate multithreading in xz than it is to parallelize the whole pipeline; if you can combine this with pigz and/or multiple files, you can even be a bit faster than a quarter of the sum of the single steps.