5

How can I get tar (or any other program that is commonly available on Linux, e.g. pax) to hardlink duplicate content on-the-fly during archive creation?

That is, I'd like to avoid hardlinking upfront and instead would like tar to take care of it.

Is that possible? How?

Anthon
  • 78,313
  • 42
  • 165
  • 222
0xC0000022L
  • 16,189
  • 24
  • 102
  • 168
  • Could you add some context about this? What is your use case? – Faheem Mitha Apr 12 '15 at 12:32
  • @FaheemMitha: I have a bunch of different versions of a particular program and want to bundle those (unpacked) folders which are named after the version into a single archive. When I tried hardlinking explicitly on disk I had savings up to 90% over using plain `tar` with deflate (gzip). – 0xC0000022L Apr 12 '15 at 18:15
  • Is this particular program your own work? If so, it sounds like you should consider distributed version control, and using multiple branches. – Faheem Mitha Apr 12 '15 at 18:26
  • @FaheemMitha: it is not, and I know VCSs of various kinds and know it would be the right thing to do. However, this is no option as it cannot be automated, like the simple creation of an archive can be. You can add files and so on, but you'll quickly hit limitations when trying to automate renames and the likes. Also moved files/folders (technically renames or vice versa, whichever way you prefer to look at it). – 0xC0000022L Apr 12 '15 at 21:01

1 Answers1

3

This is not possible as of now with GNU tar, but things exist :

Note that hardlinking and deduping does not have the same semantics, one would need another kind of tar node type to represent "dupe data" in order for the archive extraction process to properly recreate duplicate (and thus indendently-living) files; that would create incompatible tar archives with most standard tools (GNU tar, pax, etc) which would be a bold move.

Let me insist : if you were able to assimilate dupes with hardlinked files, you would have a problem at archive extraction : * are they really hardlinked files like those of a Git repo ? Then they should be really recreated as hardlinked files otherwise the restored Git repo won't work ? * or are they really identical files, and if they were restored as hardlinks, the restored archive would probably have major data leaks (imagine you have assimilated identical /etc/passwd files while archiving, restore them as a single file with multiple hardlinks : one modification in one VM becomes visible in the other!)

Faheem Mitha
  • 34,649
  • 32
  • 119
  • 183
zerodeux
  • 241
  • 1
  • 3
  • Any reason why tar cannot dedup itself? and "un-dedup" during restoration? It seems this is just meta-data to store inside the tar. – Gregory Sep 11 '22 at 15:59
  • @Gregory : I'm not sure, but I guess the reason is that it simply has not been required/implemented. Tar is a "streaming" tool which makes it use little and _constant_ memory whatever the size of the archive you're working with. When you want to track hardlinks, it's already a bit costly in memory and scales less, a hash of "seen hardlinks" must be maintained. Deduping mean checksuming data (and hash by content) which requires much more CPU and again more memory. I guess that's not the spirit of "tar" of going that "heavy" way... – zerodeux Sep 12 '22 at 19:12
  • I understand, it's the unix philosophy "less is more". The de-duplication and micro-optimizing the order of the archives for better compression is left up to us. – Gregory Sep 13 '22 at 20:46