48

[EDIT #1 by OP: Turns out this question is quite well answered by exiftool creator/maintainer Phil Harvey in a duplicate thread on the ExifTool Forum]

[EDIT #2 by OP: From ExifTool FAQ: ExifTool is not guaranteed to remove metadata completely from a file when attempting to delete all metadata. See 'Writer Limitations'.]

I'd like to search my old hard drives for photos that are not on my current backup drive. Formats include jpg, png, tif, etc..., as well as various raw formats (different camera models and manufacturers).

I'm only interested in uniqueness of the image itself and not uniqueness due to differences in, say, the values of exif tags, the presence/absence of a given exif tag itself, embedded thumbnails, etc ...

Even though I don't expect to find any corruption/data-rot between different copies of otherwise identical images, I'd like to detect that, as well as differences due to resizing and color changes.

[Edit #3 by OP: For clarification: A small percentage of false positives is tolerable (a file is concluded to be unique when it isn't) and false negatives are highly undesirable (a file is wrongly concluded to be a duplicate).]

My plan is to identify uniqueness based on md5sums after stripping any and all metadata.

How can I strip the metadata?

Will exiftool -all= <filename> suffice?

StarGeek
  • 105
  • 2
Jeff
  • 681
  • 1
  • 5
  • 8
  • 2
    JPEG compression libraries compress in different ways, therefore, even if you strip all metadata you may still end with the same image having a different checksum because it was compressed with a different JPEG implementation. You will need to re-save all images using the same library (which may decrease quality somewhat). Also how do you plan to find all the images? `file` will fail to discover RAW image formats and `find` will only work on extensions (it may be useful to describe better what you have) – grochmal Sep 27 '16 at 19:34
  • I've been using `find $dir -type f -regextype posix-extended -regex ".*\.(jpg|png|<...>|cr2|raw|raf|orf)"` where `<...>` means a a bunch of other suffixes. – Jeff Sep 27 '16 at 19:42
  • 1
    Good point about different compression libraries. – Jeff Sep 27 '16 at 19:42
  • 1
    You can try if BMP normalized images `convert image.jpg - | md5sum` (ImageMagick) give you appropriate MD5 sums. – aventurin Sep 27 '16 at 20:31
  • @aventurin - I like that one! But wouldn't that print exif metadata too? I'd add `-strip` to be sure. (BMP is a lousy standard) – grochmal Sep 27 '16 at 20:38
  • @grochmal Yes, `-strip` is a good idea. – aventurin Sep 27 '16 at 20:51
  • Oops, I forgot to specify the output format: `convert -strip image.jpg bmp:- | md5sum`. – aventurin Sep 27 '16 at 21:11
  • 2
    There is a perceptual hashing algorithm called phash that it useful for comparing how perceptually similar two images are. stackoverflow has a tag on here http://stackoverflow.com/questions/tagged/phash Now having a tool that compares two files is useful, but might lead to having work O(n*n).to find all matches. There are probably workflows that do better, but I do not know one offhand. But phash is a breadcrumb that might lead you to one. Apparently imagemagick has some sort of phash support – infixed Sep 27 '16 at 22:25
  • `phash` could be really useful for a future project of assembling the family of files that are various edits of a given photo. For the current project, two files are considered unique to each other even if they differ by only one non-metadata bit. – Jeff Sep 27 '16 at 22:38
  • Yes, I often strip exifdata in edited versions, but try my best to not alter originals produced by the cameras. If my original is named foo.bar, then I name any derivative as foo.bar.. For example, an edit of `p1234567.jpg` gets saved as `p.1234567.jpg.1.jpg` – Jeff Sep 28 '16 at 19:28
  • Related: https://askubuntu.com/questions/260810/how-can-i-read-and-remove-meta-exif-data-from-my-photos-using-the-command-line – Ciro Santilli OurBigBook.com Sep 06 '19 at 11:26
  • @Jeff You're asking how to strip metadata. Fine, but that's not really needed to achieve what you wanted to do. Use `identify -format "%# %f\n" *.jpg` from ImageMagick to get the signature of image files. Files that share a signature have the same content, even if they have distinct MD5 checksums due to different metadata. – n.r. Apr 27 '22 at 04:43

7 Answers7

48

With imagemagick package and not only for JPEGs you can simply:

mogrify -strip ./*.jpg

The ./ is to avoid problems with filenames starting with "-".

From manual:

-strip strip the image of any profiles, comments or these PNG chunks: bKGD,cHRM,EXIF,gAMA,iCCP,iTXt,sRGB,tEXt,zCCP,zTXt,date.

Much more info and caveats here.

This is similar to @grochmal, but much more straightforward and simple.

Pablo A
  • 2,307
  • 1
  • 22
  • 34
  • 4
    As per that thread, better to go with `exiftool -all= *.jpg` to strip jpg data. – Walt W Feb 02 '19 at 20:49
  • 2
    Note that this will also remove the "orientation" metadata, which will make some photos appear to be rotated the wrong way. – Flimm Aug 23 '21 at 14:37
16

jhead has the ability to remove non-image metadata from JPEG files. The man page says:

-dc

Delete comment field from the JPEG header. Note that the comment is not part of the Exif header.

-de

Delete the Exif header entirely. Leaves other metadata sections intact.

-di

Delete the IPTC section, if present. Leaves other metadata sections intact.

-dx

Delete the XMP section, if present. Leaves other metadata sections intact.

-du

Delete sections of jpeg that are not Exif, not comment, and otherwise not contributing to the image either - such as data that photoshop might leave in the image.

-purejpg

Delete all JPEG sections that aren't necessary for rendering the image. Strips any metadata that various applications may have left in the image. A combination of the -de -dc and -du options.

Toby Speight
  • 8,460
  • 3
  • 26
  • 50
12

This is a bit old, but yes, exiftool works very well.

Show metadata of

exiftool photo.jpg

Show metedata for all *.jpg files

Note: The extension is case sensitive.

exiftool -ext jpg

Same as above, but include sub directories.

exiftool -r -ext jpg .

Remove all metadata

exiftool -all= -overwrite_original photo.jpg

Remove all metadata of all *.jpg files in the current directory

exiftool -all= -overwrite_original -ext jpg 

Same as above, but include sub directories.

exiftool -all= -r -overwrite_original -ext jpg .

Remove all GPS metadata of *.jpg files in the current directory

exiftool -gps:all= *.jpg
R J
  • 812
  • 6
  • 9
  • Note that `exiftool -all= ` will remove all metadata, including the "orientation" metadata. It may make some photos appear to be rotated the wrong way. – Flimm Aug 23 '21 at 14:39
8

I would go with ImageMagick for most images. This is because different library implementations will produce different compressed results, ImageMagick can perform a compression unification.

Common types are easy because the OS has libraries to read and write them. So:

find . -type f -name '*.jp*g' -o -type f -name '*.JP*G' \
       -exec mogrify -strip -taint -compress JPEG {} \;

find . -type f -name '*.png' -o -type f -name '*.PNG' \
       -exec mogrify -strip -taint -compress Lossless {} \;

find . -type f -name '*.gif' -o -type f -name '*.GIF' \
       -exec mogrify -strip -taint -compress LZW {} \;

This will ensure that you have the images written in the same way. And then you can perform:

find . -type f -regextype posix-extended \
       -regex ".*\.(jpe?g|JPE?G|png|PNG|gif|GIF)" \
       -exec md5sum {} \; > checksums
sort -k 1 checksums |
cut -d ' ' -f 1 |
uniq -d |
while read x; do
    grep $x checksums
done

For the RAW formats I believe that the only way is to do as Phil says, and therefore:

find . <blah blah> -exec exiftool -all= {} \;

And then the checksumming would be the same. You just need to cross fingers that the more exotic image formats can be created with a single implementation (or have a rigid file format).

Disclaimer: This will work to compare the checksums between themselves. If you store the checksums and then re-run the -strip after an update of zlib or libjpeg you may end with completely different checksums. You need to build the checksums for every image every time. Given concerns about image quality it is wise to run this only once.

grochmal
  • 8,489
  • 4
  • 30
  • 60
  • Correct me if I wrong. Suppose two files represent the same image but were compressed with two different libraries. Won't they 'uncompress' into different pixels because jpg is lossy? – Jeff Sep 27 '16 at 20:26
  • 1
    Often not, JPEG2000 has a well defined DCT, but that is only the part of transforming the image. The huffman coding should also be the same. But that is as far as the standard goes, you can then actually compress the result using a compression library. In theory compression libraries (e.g. zlib) will always produce different results (even for the same algorithm), but most jpeg libraries seed the RNG in the same way to keep things sane (e.g. libjpeg does this). – grochmal Sep 27 '16 at 20:33
  • @Jeff The problem is quite natural since lossy means that information is lost. – aventurin Sep 27 '16 at 20:33
  • Of course if you define different compression quality (e.g. `-quality`) all bets are off. – grochmal Sep 27 '16 at 20:34
  • There might be a problem with this answer. JFIF tags, including JFIFversion are **inserted** by imagemagick option `-strip`. To see this, run `exiftool -a -G1 -s ` on files created with `mogrify -strip` and `exiftool -all=`. To confirm, run `exiftool -a -G1 -s | grep JFIF`. Future runs of the script would somehow have to take this into account if the JFIF version were different. – Jeff Sep 27 '16 at 21:31
  • @Jeff - Hmm... that's true. I have added a disclaimer to the answer pointing even more issues like that. In general it is wise to put all images together and then check for dupes only one time, not store checksums and try to check against them at a later date. – grochmal Sep 27 '16 at 21:50
  • Note that both `mogrify -strip` and `exiftool -all= ` remove the "orientation" metadata, which will make some photos appear to be rotated the wrong way. – Flimm Aug 23 '21 at 14:39
1

Instead of MD5, use ImageMagick's identify to print the signature of images files. Look for files having the same signature. Files that share a signature have the same content.

For example, files a.png, b.png, and c.png are different, since they have different MD5 checksums:

$ md5sum *
a9ee60d8237a4b3f6cdd6e57c24b1caf  a.png
e8661c4fd7761984a74945e273fd4d09  b.png
21c808d62ff9c7675c1f9ca20d2f6578  c.png

However, they share a signature:

$ identify -format "%#  %f\n" *
1c916332636b91704f212eec504c25383c90ed5d1659975a4a5895c48fe80ab8  a.png
1c916332636b91704f212eec504c25383c90ed5d1659975a4a5895c48fe80ab8  b.png
1c916332636b91704f212eec504c25383c90ed5d1659975a4a5895c48fe80ab8  c.png

Therefore they're duplicates.

n.r.
  • 2,173
  • 3
  • 18
  • 30
0

A possible solution that just came to mind. It sidesteps the issue of metadata. It assumes that files end with the image itself, that all the metadata is at the beginning of the file.

Let's refer to the current backup drive as the gold drive.

For images on the gold drive:

  1. Remove any embedded thumbnail.
  2. Chunk up the file starting at their end by tailing off, say, M=100k bytes. Refer to the first tailing (that contains the end of the file) as the end-chunk.
  3. Compute the md5sums of each chunk and store them in a master list called the goldlist.

For images on the old drives:

  1. Remove any embedded thumbnail.
  2. Tail off the last M bytes a file.
  3. Compute its md5sum.
  4. CLASS U: If the sum is not in the goldlist, then conclude the file is unique to the gold-drive. Copy it to the gold-drive. Compute md5sums of remaining chunks and add them to the goldlist. Go on to the next file.
  5. Otherwise, tail off the second to last M bytes. But if the remaining bytes are less than, say, N=50k, then don't tail off the M bytes. Instead process the remaining as a slightly oversized chunk. N needs to be larger than the largest space consumed by the header regions (thumbnails excluded).
  6. Compute the chunk's md5sum.
  7. Compare to goldlist, and so on.
  8. CLASS D: If the sums for all the chunks are in the goldlist, then conclude it is a duplicate.
  9. CLASS P: If the sums for all chunks but the last are in the goldlist, then conclude it is probably a duplicate.

Class P will contain images that are on the gold-drive, but have different exifdata, or have corruption/data-rot in the leading bytes of the image.

When done, examine CLASS P interactively, comparing them to their mates on the gold-drive.

See EDIT #3 to OP.

Assignment into CLASS U and D should be 100% accurate.

The size of CLASS P depends on the chunk size M, since the first M+N bytes of a file almost certainly contain some image data (and all the metadata)

Jeff
  • 681
  • 1
  • 5
  • 8
  • I did some formatting of your post (so it uses markdown enumeration rather than crammed paragraphs). Still I find it quite esoteric to figur out what you mean by CLASS U, CLASS D, CLASS P... – grochmal Sep 28 '16 at 18:24
  • assign each image file on an old hard drive to one of three classes U(nique), D(uplicate) P(robably duplicate) – Jeff Sep 28 '16 at 18:35
0

If old drives contain mostly duplicates (including metadata) then use two steps to find the uniques as defined in the OP (which considers two files to be duplicates even if they differ in metadata):

  1. Use md5sums of intact unstripped files to identify which files on the old drives are unique (in this alternate sense) to the current backup drive, assigning them to either CLASS uU (unstripped-Unique) or CLASS D(upilcate). CLASS D will be 100% accurate. CLASS uU ought to be small (by above assumption) and contain a mix of true duplicates (in the OP Sense) and true uniques.

  2. Working with the small, i.e. managable, set of files in CLASS uU, use md5sums and various stripping techniques to design a method of file comparison that is useful for purposes laid out in OP.

Jeff
  • 681
  • 1
  • 5
  • 8