What I want to know, is what is tar doing at the start, before it starts passing data on to gzip? Can I make it skip that step?
I'm writing a script to run on my Synology NAS box (running DSM 6.2.1-23824 Update 1, with tar version 1.28) to compress copies of virtual machine hdd images. The source files are stored as sparse files on a btrfs filesystem. I'm looking for a little bit of compression, preferably keeping the sparseness, and as much speed as it can.
While I am working with only 1 file at a time the reason for using tar in the first place is to use its --sparse flag, as gzip cannot unzip a file as a sparse file. The central command I'm trying to run is:
GZIP=-1 nice -n 19 tar --keep-old-files --sparse -czf $destDir/$vmFolder/$file.tar.gz $file 2>>$log
However with the size of the HDD images (ranging from 2GB to 120GB), there are many minutes when tar starts, it is furiously reading the source as fast as it can, but gzip is not being given anything to work with. The length of time this goes on for scales with the size of the source file.
Things I've tried to work around the issue:
- If I just use gzip the output starts straight away, but I lose the sparse info.
If I use pipes, as below, it does the same thing.
nice -n 19 tar --keep-old-files --sparse -cf - $file | nice -n 19 gzip --fast > $destDir/$vmFolder/$file.tar.gz 2>>$log
Admittedly the NAS box only has an Intel Atom D2700, but the tar operation shouldn't be CPU intensive. I can appreciate that gzip is cpu intensive and this will be a limiting factor, particularly with an old Atom CPU. I was hoping to use lz4 or lzop but the Synology OS doesn't seem to have them, just gzip, 7z, and xz.
Note that as part of the script it can run as many of these commands in parallel as I like using this semaphore script as a template to utilise all cores of the CPU even with single threaded gzip.
Edit: Testing my script without the --sparse option, but still using tar, does not have this problem, and the data immediately flows through to gzip.