Using FSlint to find duplicates by file size only?

Question

I'm trying to use fslint to find duplicates, but it takes forever hashing entire multi-gigabyte files. According to this website, I can compare by the following features:

feature summary

compare by file size
compare by hardlinks
compare by md5 (first 4k of a file)
compare by md5 (entire file)
compare by sha1 (entire file)

but I don't see these options in the GUI or the man pages. Is there something I'm missing here?

Edit: I'm using jdupes instead with the command line:

jdupes -r -T -T --exclude=size-:300m --nohidden

To get this work, I had to clone the git repository and build from source. (The packaged version is woefully out of date.)

I also had to edit the source code to change every:

#define PARTIAL_HASH_SIZE 4096

to

#define PARTIAL_HASH_SIZE 1048576

and then it actually matched my files correctly. I don't know why they coded it this way, but only matching the first 4096 bytes isn't nearly enough and produces false duplicates. (Maybe a command line option would be useful here)

score 3 · Accepted Answer · answered Jul 06 '19 at 08:53

You didn’t miss anything regarding FSlint; it does indeed support all those comparisons, but it doesn’t allow them to be configured – it de-duplicates using all that information, all the time.

findup is itself a shell script, and each comparison is separated. The optional blocks are indicated, so you can comment them out to skip the tests you don’t want.

Regarding jdupes, I see you filed an issue about the hash size; it’s more productive to continue the discussion there than address it here.

score 0 · Answer 2 · answered Jul 06 '19 at 09:28

You could find duplicate file sizes like this:

find -type f -printf "%s\n" | sort -n | uniq -d

and then do whatever you like with that... for example, grep.

find -type f -printf "%s %p\n" \
| sort -n \
| grep -f <(find -type f -printf "^%s \n" | sort -n | uniq -d)

and it finds stuff with same size (regardless of content)

257659 ./b
257659 ./bsort
257764 ./a
257764 ./asort

only matching the first 4096 bytes isn't nearly enough and produces false duplicates

Feel free to add conditions, like hashing start, end, middle of file. It's not too difficult to script it yourself.

However, it might nor might not be a duplicate until you read all of it. There is no heuristic that catches everything, as a difference can appear anywhere. So if you take a shortcut, you always accept false duplicates as a matter of course.

The only faster way to check for duplicates is to make it a hardlink. If it's one and the same file, you don't have to check size or content anymore.

Using FSlint to find duplicates by file size only?

2 Answers2