0

I need to know the size (in bytes) of binary difference between two folders containing hundreds of files. ideally searching for a tool where something like xtool --recursive /path/to/folderA/path/to/folderB would output something like:

folderA and folderB have X bytes in common and Y bytes difference.

Since this tool probably does not exist (I know of xfdelta that does diffs per file that I check sizes of), a good method / script to achieve this goal would be tremendous.

Thinking maybe of rsync -rlptDEuiIn /folderA /folderB --stats of which I could get Total trasferred file size line ? Maybe there is a better alternative ?

Orsiris de Jong
  • 393
  • 2
  • 10
  • I do not fully understand the requirement: consider `abcfekghi` with `abcdeghij`. For byte-to-byte comparison, only `abc` and `e` match. Would you expect `abc` `e` and `ghi` to match by some best-fit realignment mechanism? – Paul_Pedant Aug 21 '22 at 08:14
  • I am doing deduplication backup benchmarks for multiple programs, see https://github.com/deajan/backup-bench and I'd like to know how much bytes differ between different versions of a folder which are backed up. The rough idea is to say folder X v1,v2 and v3 have Z bytes difference, and program A did 20% more than Z, whereas program B did 35% more than Z. – Orsiris de Jong Aug 22 '22 at 13:50
  • That does not answer my question. If you compare bytes one-for-one by position in the file, a single byte insertion makes every following byte out of place. If you compare with some longest-match algorithm (as `diff` does with lines) the task is two orders of magnitude more complex (and arguably, indeterminate). How your files came to be different makes a great difference to how you can best measure the changes. – Paul_Pedant Aug 23 '22 at 06:40
  • Does this answer your question? [Diff of two similar big raw binary files](https://unix.stackexchange.com/questions/565559/diff-of-two-similar-big-raw-binary-files) – Paul_Pedant Aug 23 '22 at 06:46
  • I'd required a realignment mechanism, a la git that finds where new data has been inserted in a text file. rsync thinks in terms of different chunks, I'd love to have a solution that thinks in terms of inserted data that shifts next data. – Orsiris de Jong Sep 04 '22 at 09:14
  • Text file changes are relatively easy because they are whole-line events, so fairly distinctive. In arbitrary byte data, the resynchronisation problem is to recognise many longest common strings when you do not know the start position of either string, nor the length to match. I think that might be at least Order(n-cubed) and probably worse (and where n is the number of bytes, not lines). Might be interesting to prototype in diff after placing each byte on a different line (maybe in hex to avoid special characters like (er..) newline. – Paul_Pedant Sep 09 '22 at 08:29
  • 1
    Prototyped the `od | fold | diff` route, and it fails badly. `diff` assumes resync between the data on the first match, with no look-ahead for the length of the match. Works with whole lines, but matching one-byte lines generates spurious matches by the hundred. This really needs a `find_longest_exact_match` function applied recursively, which is simple enough, but exponentially slow. – Paul_Pedant Sep 10 '22 at 12:20
  • Thanks for the try. If I'd compare text only files, I would still not be able to go the `rsync` route since `rsync`doesn't check for data shifting. Using a tool like `git` would be helpful, since it's designed for that exact job. I think I have to search that way. – Orsiris de Jong Sep 18 '22 at 18:15

0 Answers0