Remove first instance of lines with duplicate field value

Question

I have sha1 summed all the image files on my storage server and put the results in a text file in the form of:

sha1sum filename

I've sorted the file and removed all all unique sha1sum entries. So what I am left with is a list of duplicate files. Some have two entries, some three, some even more.

What I want to do is to remove only the first entry of each duplicate sha1sum so I can use the resulting output to delete the duplicate files (and keep only one instance of each)

I don't really care which version gets kept as I will be moving all the files into some form of directory hierarchy later

score 4 · Answer 1 · answered May 21 '12 at 01:01

With GNU utilities, as found on Linux or Cygwin, you can tell uniq to separate each block of files with the same hash. Calling uniq with the option --all-repeated removes unique files from the list in the process.

sha1sum * |
sort | uniq -w 40 --all-repeated=prepend |
sed -e '/^$/ { N; d; }' -e 's/^[^ ]*  //' |
tr '\n' '\0' | xargs -0 rm --

This isn't worth the effort over this simple, portable awk script: print each line if its first field is identical to the first field of the previous line. Again, this takes care of removing unique files from the list.

sha1sum * |
sort |
awk '$1==h {print}  {h=$1}' |
tr '\n' '\0' | xargs -0 rm --

Instead of doing this manually, you could call fdupes.

fdupes -f

score 1 · Answer 2 · answered Jun 09 '14 at 01:21

You could also use awk 'a[$1]++'

$ gsha1sum *
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file1
e9d71f5ee7c92d6dc9e92ffdad17b8bd49418f98  file2
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file3
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file4
$ gsha1sum *|awk 'a[$1]++'
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file3
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  file4

Like the commands posted by Gilles, it also removes lines whose first field only appears once in the input.

a[$1]++ could be replaced with a[$1]++>0 or ++a[$1]>=2.

Remove first instance of lines with duplicate field value

2 Answers2