1

I have a number of folders with my various media (e.g. photos, music) from different points in time. The different folders have some of the same content (e.g. a photo might be in 2 folders), but should be mostly unique. There are no guarantees on the filename in different folders - e.g. a photo might be present as A/foo.png and B/bar.png. Alternatively, A/baz.png and B/baz.png might not be the same file.

I'm looking for some way to consolidate all of the media into a single, flat folder, with duplicates removed. Ideally, some tracking of where the files originally came from would be nice (e.g. knowing that output/001.png came from A/baz.png, etc), but this isn't strictly necessary. There are a lot (1M+ files), so the faster the better :).

I originally tried to just copy all of the files from the folders into a new folder, but this took a long time, and would only deduplicate if the filenames are identical, which isn't true in this case. I think there might be some way to get this command to go faster with xargs -P but I wasn't sure how.

find . -type f -exec cp {} \;

A two stage system or similar is fine - e.g. first flatten and rename all of the files into a new folder so that they all have unique filenames, and then filter out duplicates. I have the storage space to do that, I'm just not sure how to do it.

AdminBee
  • 21,637
  • 21
  • 47
  • 71
Vasu
  • 111
  • 1
  • 1
    Doing it the other way around word probably be better, use `fdupes` to find and delete the duplicates, then move everything to a single directory while taking care of filename collisions. – Kusalananda May 27 '20 at 07:03
  • Or `jdupes`, it seems faster than `fdupes`. – Kamil Maciorowski May 27 '20 at 07:06
  • Thanks for the suggestion. I've made a copy of all of my data and am working through deduplicating it now with `jdupes`: `jdupes copy -Z -r -d`. Might rerun with `jdupes copy -Z -r -d -N` if it turns out there's too many duplicates to go through by hand. Do you have suggestions for how to copy / rename all of the files to a new folder? I could write a quick python script to do it but maybe there's a better / faster option. – Vasu May 27 '20 at 07:29

0 Answers0