Output lines with the same md5 sum

Question

I have a script like this

find path -type f -exec md5sum {} +'

It has this conclusion

/tmp
❯ find $pwd -type f -exec md5sum {} + 

\a7c8252355166214d1f6cd47db917226  ./guess.bash
e1c06d85ae7b8b032bef47e42e4c08f9  ./qprint.bash
8d672b7885d649cb76c17142ee219181  ./uniq.bash
2d547f5b610ad3307fd6f466a74a03d4  ./qpe
523166a51f0afbc89c5615ae78b3d9b0  ./Makefile
57a01f2032cef6492fc77d140b320a32  ./my.c
c5c7b1345f1bcb57f6cf646b3ad0869e  ./my.h
6014bc12ebc66fcac6460d634ec2a508  ./my.exe
0ff50f0e65b0d0a5e1a9b68075b297b8  ./levik/2.txt
5f0650b247a646355dfec2d2610a960c  ./levik/1.txt
5f0650b247a646355dfec2d2610a960c  ./levik/3.txt

We need such a conclusion

5f0650b247a646355dfec2d2610a960c  ./levik/1.txt
5f0650b247a646355dfec2d2610a960c  ./levik/3.txt

This looks like a classic [XY problem](https://en.wikipedia.org/wiki/XY_problem). You're obscuring your actual problem (finding duplicates) by asking about a problem you've encountered with your solution. Unless you're looking for MD5 collisions between different files for some reason? — gronostaj, Jan 20 '22 at 15:40

pLumo · Answer 1 · 2022-01-19T14:09:19.627

15

If your task is to find duplicate files, you could also use fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

fdupes -r .

edited Jan 19 '22 at 14:09

answered Jan 19 '22 at 11:35

pLumo

22,231
2
41
66

Stephen Kitt · Accepted Answer · 2022-01-19T18:53:19.180

If you’ve got GNU uniq, you can ask it to show all lines duplicating the first 32 characters¹:

find path -type f -exec md5sum {} + | sort | uniq -D -w32

The list needs to be sorted since uniq only spots consecutive duplicates. This also assumes that none of the file paths contain a newline character; to handle that, assuming GNU implementations of all the tools, use:

find . -type f -exec md5sum -z {} + | sort -z | uniq -z -D -w32 | tr '\0' '\n'

(GNU md5sum has its own way of handling special characters in file names, but this produces output which isn’t usable with uniq in the way shown above.)

^{¹ Technically, in current versions of GNU uniq, it's the first 32 bytes that are considered, for instance UTF-8 encoded á and é characters would be considered identical by uniq -w1 as their encoding both start with the 0xc3 byte. In the case of 0-9a-f characters found in hex-encoded MD5 sums though, that makes no difference as those characters are always encoded on one byte.}

Output lines with the same md5 sum

2 Answers2