1

I have a number of folders with a few million files (amounting to a few TB) in total. I wish to find duplicates across all files. The output ideally is a simple list of dupes - I will process them further with my own scripts.

I know that there is an fdupes command which apparently uses "file sizes and MD5 signatures" to compare files.

What is unclear to me is whether files that are unique in size are read (and their hash computed) which I do not want. With the sheer amount of data in my situation care needs to be taken not to do any more disk I/O than absolutely necessary. Also, the amount of temporary space used ought to be minimal.

Ned64
  • 8,486
  • 9
  • 48
  • 86

3 Answers3

1

FSlint and its backend findup probably do exactly what you need:

FSlint scans the files and filters out files of different sizes. Any remaining files of the exact same size are then checked to ensure they are not hard linked. A hard linked file could have been created on a previous search should the user have chosen to 'Merge' the findings. Once FSlint is sure the file is not hard linked, it checks various signatures of the file using md5sum. To guard against md5sum collisions, FSlint will re-check signatures of any remaining files using sha1sum checks.

https://booki.flossmanuals.net/fslint/ch004_duplicates.html

Murphy
  • 2,609
  • 1
  • 13
  • 21
  • Thanks. Is there an option to check block wise not to read (and compare) the whole file? – Ned64 Feb 28 '20 at 19:49
  • What do you mean with "to check block wise not to read [...] the whole file"? And what advantage should that provide? However, the help text you find on the linked doc page is all there is to `findup`, so the answer is most probably "no". – Murphy Feb 28 '20 at 19:58
  • The advantage is not reading the whole file if the first 1k or so differs. I have very many files! Apparently `fslint` does that, though, by using `/usr/share/fslint/fslint/supprt/md5sum_approx`, as an option. Will try and let you know. – Ned64 Feb 28 '20 at 20:04
  • Whether that speeds up the comparison depends how many large files you have that differ in the first block. If they are equal the comparison still has to check the complete file to be sure; may be faster to do that from the start. Also [there seems to be a performance penalty to `md5sum_approx`](https://github.com/pixelb/fslint/blob/master/fslint/supprt/md5sum_approx). – Murphy Feb 28 '20 at 20:30
1

rmlint is a very efficient tool to deduplicate filesystems and more, caching information if wanted via xattrs to make followup runs even faster, and providing metadata in json format to let you use the information it digs out in custom ways:

rmlint finds space waste and other broken things on your filesystem and offers to remove it. It is able to find:

Duplicate files & directories.
Nonstripped Binaries
Broken symlinks.
Empty files.
Recursive empty directories.
Files with broken user or group id.

From the User manual — rmlint

nealmcb
  • 766
  • 9
  • 16
0

Yes I think it will create a full md5, if the size matches another file. This could be wasteful. A more efficient way for large files may me to md5 the first block, and only look farther if they match.

I.e. check size, if match then check md5 of first block (512k), if match then check md5 for next 2 blocks (1024k) ... etc.

ctrl-alt-delor
  • 27,473
  • 9
  • 58
  • 102