17

I have no experience with btrfs, but it's advertised to be able to de-duplicate files.

In my application, I'd need to duplicate whole directory trees.

From what I learned, btrfs only de-duplicates in some post scan, not immediately. Even just using cp doesn't seem to trigger any de-duplication (at least, df shows an increased disk usage in the size of the copied files).

Can I avoid moving data around altogether and tell btrfs directly to duplicate a file at another location, essentially just cloning its metadata?

In essence, similar to a hardlink, but with independent metadata (permissions, mod. times, ...).

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
Udo G
  • 1,123
  • 3
  • 12
  • 27
  • 8
    `cp --reflink=always`. – mikeserv Dec 16 '15 at 17:35
  • 4
    Note that this isn't anything like a hardlink. When you `cp --reflink=always`, the result from the user perspective will be two completely independent files in every way. The fact that the underlying file system is abstracting that via copy-on-write is only an implementation detail. You don't get "a hardlink, but with independent metadata.". To my knowledge, btrfs doesn't do any automatic deduplication yet. I think that's a future plan but I'm not positive on that. – ormaaj Dec 16 '15 at 18:32
  • @ormaaj - a hardlink wouldn't have *independent metadata*. and Udo asked for an *implementation detail*. when you do a reflink to a file you *essentially clone its metadata*. its only when the references independently *change* that the files diverge - and that's what deduplication is all about! – mikeserv Dec 16 '15 at 20:24
  • 1
    @mikeserv Er, I'm pretty sure deduplication has a different sense. Deduplication is taking already existing redundant copies of data and re-unifying it. COW is a means of minimizing duplication, it isn't deduplication. – ormaaj Dec 16 '15 at 20:45
  • @ormaaj - i think thats a weird thing to say: *deduplication is not about minimizing duplication.* – mikeserv Dec 16 '15 at 21:10
  • @mikeserv "reducing" would be a better word. You're talking about [lazy copying](https://en.wikipedia.org/wiki/Object_copying#Lazy_copy). It's an optimization in the allocation of resources. In contrast with deduplication, which is an active attempt to recover "wasted" resources. The [KSM facility](https://en.wikipedia.org/wiki/Kernel_same-page_merging) of Linux would be a good example of deduplication. After scanning for and replacing redundant pages with references, it uses an efficient copy-on-write scheme. They're related but distinct concepts. – ormaaj Dec 16 '15 at 22:04
  • I'm afraid my hardlink note was a bit misleading. In my case I'm effectively looking for a way to optimize resources (minimize disk usage and disk writes). Having two independent copies is okay *in my case* as there will be no writes to the file *contents*, but the file may need to show up at three different locations in the file system with different permissions/owners. – Udo G Dec 17 '15 at 07:00

1 Answers1

18

There are two options:

  1. cp --reflink=always
  2. cp --reflink=auto

The second is almost always preferable to the first. Using auto means it will fallback to doing a true copy if the file system doesn't support reflinking (for instance, ext4 or copying to an NFS share). With the first option, I'm pretty sure it will outright fail and stop copying.

If you are using this as part of a script that needs to be robust in the face of non-ideal conditions, auto will serve your better.

eestrada
  • 559
  • 4
  • 11
  • are you Eric Estrada? – mikeserv Dec 20 '15 at 02:27
  • 4
    @mikeserv Lol, no. My first name is Ethan. That would be funny though; Eric Estrada: actor by day, sysadmin by night. Believe it or not, this is the first time in over a decade of going by the online handle `eestrada` that anyone has ever asked me that. – eestrada Dec 20 '15 at 02:42
  • 4
    sure, Eric. anyway, good answer. – mikeserv Dec 20 '15 at 05:38
  • If anyone is going to answer after 5 years is there any harm in doing this? `alias cp='cp --reflink=auto'` ? – Matt Mar 12 '21 at 16:37
  • 2
    @Matt So long as you know your alias will always be run on a relatively recent version of the GNU userland, I don't personally see any harm in using `alias cp='cp --reflink=auto'`. – eestrada Mar 14 '21 at 03:40