1

Let's say:

  • a is a 256 MB file containing random bytes
  • b is the same file except it has one additional leading byte 0

Thanks to this answer, I discovered that rsync is able to compute a "binary diff patch" between these two files:

rsync --only-write-batch=patch b a

In this example, the patch file is ... only 65 KB, so it's very good.

In short, how did rsync detect so few byes were changed? I initially thought it would compare:

  • a[0:k] and b[0:k]
  • a[k+1:2k] and b[k+1:2k]
  • a[2k+1:3k] and b[2k+1:3k]
  • ...
  • a[N-k:N] and b[N-k:N]

for various values of k, e.g. the biggest power of 2 possible (2^j), then if no match, 2^(j-1), then 2^(j-2), etc.

But for these files a and b, it would totally fail because since b is just a shifted of one byte, there would be no similar chunks at all! Then we would expect the patch to be ... 256 MB.

But here it works in a more clever way, how did the algorithm work in this simple example b = a byte concatenated with the content of a ?

Basj
  • 2,351
  • 9
  • 37
  • 70

1 Answers1

2

Perhaps someone who knows this better can post another answer, but after further research, the key in rsync algorithm seems to be detailed in the paragraph "Determining which parts of a file have changed": Rolling hash.

Another useful reading: https://moinakg.wordpress.com/tag/rolling-hash/

vs. :


Another useful resource: http://tutorials.jenkov.com/rsync/overview.html

Basj
  • 2,351
  • 9
  • 37
  • 70