27

I'm migrating a server from an Ubuntu Server 18.02 instance ("saturn") to a newly-built Debian Buster 10 system ("enceladus"). I have copied a complete filesystem across the network using

sudo rsync --progress -au --delete --rsync-path="sudo rsync" /u/ henry@enceladus:/u

I check the number of directories and the number of files on the sending and receiving side: the counts are identical. I have an RYO Perl program which traverses the file tree and compares each file in one tree with its counterpart in the other: it finds no differences in 52,190 files. Both filesystems are EXT4; both have 512-byte blocks logical, 4096 physical.

Yet the receiving filesystem is 103,226,592,508 bytes and the sending one only 62,681,486,428. If the received filesystem were a little smaller I could understand it, because of unreclaimed blocks; but it's the other way round, and the difference is two thirds the original!

How can this be? Should I worry about it, as being evidence of some malfunction?

Henry Law
  • 395
  • 3
  • 4
  • 6
    What is it actually measuring? The most obvious thing would be that some files on the sending end are "sparse", i.e. they have regions where all the data is NUL bytes and these are just noted as such, rather than stored as disk blocks full of zeros. The receiving file could actually have disk blocks allocated. – icarus Jan 04 '21 at 01:42
  • I have always used -axHAWXS for file system cloning – Richie Frame Jan 04 '21 at 16:40
  • @RichieFrame Could you please explain what the relevant difference is to the command OP used since otherwise your comment isn't very useful. – Martijn Heemels Jan 06 '21 at 08:28
  • @MartijnHeemels it is useful if you are comparing those to the options used when looking at the options list. Compared to -au, it adds preservation of hardlinks, ACLs, and extended attributes, it disables the delta transfer, handles sparse files efficiently, and prevents recursion across a filesystem boundry so mount point contents are not copied, as those should be handled as their own filesystem – Richie Frame Jan 06 '21 at 08:52

1 Answers1

80

I can think of two things offhand:

  • you didn't use -H, so hardlinks are lost.
  • you didn't use -S, so sparse files may have been expanded
mattdm
  • 39,535
  • 18
  • 99
  • 133
  • 14
    Thank you for very helpful suggestions. There are no sparse files, but there are significant numbers of hard links, with large-ish images at the end of them. I'm re-doing the copy (from scratch) with -H and will post the result. – Henry Law Jan 03 '21 at 18:38
  • 37
    ... and that gave the expected result, to within a few K. Thank you very much. – Henry Law Jan 03 '21 at 19:03
  • 7
    Yeah, it's kind of surprising (in the UI sense) that `-a` does not include these options. – mattdm Jan 03 '21 at 20:54
  • 13
    @mattdm the `-a` flag omits `-H` (hard links) because handling this requires holding the entire tree of linked files in memory so that matching inodes can be identified. It omits both `-H` and `-S` because not all filesystems can support these features – roaima Jan 03 '21 at 21:26
  • 12
    Oh, I know _why_ it doesn't include them. It's just kind of a trap, as evidenced here. – mattdm Jan 03 '21 at 21:31
  • 5
    For future readers, if making a fresh FS and redoing a copy is inconvenient, you can sparsify your existing files with `find -type f -exec fallocate -d {} \;` (`+` won't work, fallocate only works on one file at once, so maybe use `-size +1M` or something to filter by file size if your small files are not sparse). This is worse than a fresh copy because of fragmentation: the free space is scattered between used blocks. Also, it will sparsify everything, when maybe you'd rather have had some files with unwritten extents (preallocated space). – Peter Cordes Jan 04 '21 at 04:44
  • 4
    You can identify and hardlink duplicates to each other with tools like `fslint` or other duplicate-file finders. (But if some sets of duplicates should *not* be hardlinks, you'd have to decide manually to hardlink or not. Or on a CoW filesystem like btrfs, to make reflinks that transparently save space for duplicate blocks, without any traditional linking semantics so future changes apply only to the changed file.) – Peter Cordes Jan 04 '21 at 04:46
  • @PeterCordes: Careful with that. I have files with large blocks of zeros that really should be as contiguous as possible. (They're filesystem images.) – Joshua Jan 04 '21 at 04:55
  • @Joshua: Yeah, like I said in the last line of the first comment, it will sparsify everything (which might not be what you want). Turning zeros into at best preallocated unwritten extents could be safe if it uses the same contiguous blocks for backing would be safe, but I don't think `fallocate(1)` has a mode for that (using `FALLOC_FL_ZERO_RANGE` instead of `FALLOC_FL_PUNCH_HOLE` to convert "written" zeros into preallocated unwritten zeros). It's open source and its `-d` loop is pretty simple, though. – Peter Cordes Jan 04 '21 at 05:06
  • 5
    For further uture reference, in some other filesystems (for example, ZFS, BTRFS, or XFS), this may also happen as a result of shared extents within files not being copied as shared. The solution there is to either use `cp --reflink=always` to force copying shared extents as shared, or making a pass over the data with a tool like `duperemove` to re-deduplicate it. – Austin Hemmelgarn Jan 04 '21 at 16:23
  • 2
    Additionally, some filesystems support compression (e.g. btrfs supports gzip, lzop and zstd). With some filesystems under some circumstances rsync can copy the per-compression state with -X. – Remember Monica Jan 05 '21 at 21:31