1

I have sha1 summed all the image files on my storage server and put the results in a text file in the form of:

sha1sum filename

I've sorted the file and removed all all unique sha1sum entries. So what I am left with is a list of duplicate files. Some have two entries, some three, some even more.

What I want to do is to remove only the first entry of each duplicate sha1sum so I can use the resulting output to delete the duplicate files (and keep only one instance of each)

I don't really care which version gets kept as I will be moving all the files into some form of directory hierarchy later

derobert
  • 107,579
  • 20
  • 231
  • 279

2 Answers2

4

With GNU utilities, as found on Linux or Cygwin, you can tell uniq to separate each block of files with the same hash. Calling uniq with the option --all-repeated removes unique files from the list in the process.

sha1sum * |
sort | uniq -w 40 --all-repeated=prepend |
sed -e '/^$/ { N; d; }' -e 's/^[^ ]*  //' |
tr '\n' '\0' | xargs -0 rm --

This isn't worth the effort over this simple, portable awk script: print each line if its first field is identical to the first field of the previous line. Again, this takes care of removing unique files from the list.

sha1sum * |
sort |
awk '$1==h {print}  {h=$1}' |
tr '\n' '\0' | xargs -0 rm --

Instead of doing this manually, you could call fdupes.

fdupes -f
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
1

You could also use awk 'a[$1]++'

$ gsha1sum *
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file1
e9d71f5ee7c92d6dc9e92ffdad17b8bd49418f98  file2
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file3
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file4
$ gsha1sum *|awk 'a[$1]++'
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file3
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file4

Like the commands posted by Gilles, it also removes lines whose first field only appears once in the input.

a[$1]++ could be replaced with a[$1]++>0 or ++a[$1]>=2.

Lri
  • 5,143
  • 2
  • 27
  • 20