11

I often have large directories that I want to transfer to a local computer from a server. Instead of using recursive scp or rsync on the directory itself, I'll often tar and gzip it first and then transfer it.

Recently, I've wanted to check that this is actually working so I ran md5sum on two independently generated tar and gzip archives of the same source directory. To my suprise, the MD5 hash was different. I did this two more times and it was always a new value. Why am I seeing this result? Are two tar and gzipped directories both generated with the same version of GNU tar in the exact same way not supposed to be exactly the same?

For clarity, I have a source directory and a destination directory. In the destination directory I have dir1 and dir2. I'm running:

tar -zcvf /destination/dir1/source.tar.gz source && md5sum /destination/dir1/source.tar.gz >> md5.txt

tar -zcvf /destination/dir2/source.tar.gz source && md5sum /destination/dir2/source.tar.gz >> md5.txt

Each time I do this, I get a different result from md5sum. Tar produces no errors or warnings.

jesse_b
  • 35,934
  • 12
  • 91
  • 140
Alon Gelber
  • 113
  • 1
  • 4
  • As you mentioned large directories, there is no excuse to not substitute `lbzip` or `7z` for `gzip` these days. This is not going to address your original question, but at least speed up compression by parallel threads. – ajeh Apr 19 '18 at 20:51

2 Answers2

14

From the looks of things you’re probably being bitten by gzip timestamps; to avoid those, run

GZIP=-n tar -zcvf ...

Note that to get fully reproducible tarballs, you should also impose the sort order used by tar:

GZIP=-n tar --sort=name -zcvf ...

If your version of tar doesn’t support --sort, use this instead:

find source -print0 | LC_ALL=C sort -z | GZIP=-n tar --no-recursion --null -T - -zcvf ...
Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
  • I was trying to reproduce this with `tar -czf - dir | md5sum` but failed to get varying checksums. Was that because I was writing to a pipe? (No it wasn't, it turns out, but because of something else relating to not using Linux presumably) – Kusalananda Apr 17 '18 at 15:19
  • @Kusalananda perhaps OpenBSD `tar` behaves differently... On Debian I get different sums when piping too. – Stephen Kitt Apr 17 '18 at 15:22
  • I used GNU `tar` 1.29, but `gzip` comes from my base system... hmm... but that got the `-n` option as well. Oh well. – Kusalananda Apr 17 '18 at 15:25
  • Can I also add following `--mode=a+rwX --owner=0 --group=0 --numeric-owner` due to set defualt file permissions of the files? @Stephen Kitt – alper Mar 22 '20 at 14:59
  • 1
    @alper of course, you can add whatever other options you want. – Stephen Kitt Apr 17 '20 at 13:35
3

On Mac @stephen-kitt's answer didn't work for me, not exactly sure why but when I separated the gzip from the tar command it started producing the same hash. Here's what I ended up with:

outputpath="$(pwd)/folder_to_zip" 
find "$outputpath" -print0 | LC_ALL=C sort -z | tar -s "#$outputpath/##" --no-recursion --null -T - -cf - | gzip -n > "$outputpath.tar.gz" && md5 "$outputpath.tar.gz"
VFein
  • 131
  • 2