3

I am working with a CSV data set which looks like the below:

year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK- DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,

It contains a number of duplicates that I need to remove (keeping one instance of the record). Based on Remove duplicate entries from a CSV file I have used sort -u file.csv --o deduped-file.csv which works well for examples like

2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,

but does not capture examples like

2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,

Where the data is incomplete, but is a representation of the same thing.

Is it possible to remove duplicates based on specified fields e.g year, manufacturer, brand, series, variation?

terdon
  • 234,489
  • 66
  • 447
  • 667
Stuart Brown
  • 133
  • 6
  • Does your CSV file actually have those line breaks in it? The ones for "Darren Smith"'s entries? If not, then don't post it like that - it's a red herring distracting potential answerers. If it does, then it's broken - CSV requires multi-line fields to be quoted. – cas Dec 11 '21 at 13:29
  • BTW, your requirements are moving towards the edge of what's possible with simple line-oriented tools. You probably need to use a tool like GNU [datamash](https://savannah.gnu.org/projects/datamash/), Miller aka [mlr](https://github.com/johnkerl/miller), or [csvkit](https://github.com/wireservice/csvkit). Or use a language with CSV libraries like perl ([Text::CSV](https://metacpan.org/release/Text-CSV) module) or python ([csv](https://docs.python.org/3/library/csv.html) library). – cas Dec 11 '21 at 13:36
  • @cas thanks for the comment. I've corrected the CSV structure in the question, the line breaks are not present in the data. Thanks for the links - I'll take a look – Stuart Brown Dec 11 '21 at 13:39
  • 1
    If your data is "simple" CSV (i.e. no quoted delimiters within fields, like `"John Amoth, Jr"`, at least not up to and including the fields that you wish to use as the deduplication key) then you can likely use something as simple as `awk -F, '!seen[$1, $2, $3, $4, $5]++' file.csv`. See [How does awk '!a\[$0\]++' work?](https://unix.stackexchange.com/questions/159695/how-does-awk-a0-work) – steeldriver Dec 11 '21 at 14:07

3 Answers3

6

I would create a "key" of the first 5 fields, and then only print a line if that key is being seen for the first time:

awk -F, '
  {key = $1 FS $2 FS $3 FS $4 FS $5}
  !seen[key]++ 
' file
year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
glenn jackman
  • 84,176
  • 15
  • 116
  • 168
3

Decorating first with awk to retain your header line at the top during the sort then using any awk+sort+cut:

$ awk -v OFS=',' '{print (NR>1), $0}' file | sort -u -t, -k1,6 | cut -d, -f2-
year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
Ed Morton
  • 28,789
  • 5
  • 20
  • 47
1

Non-awk solution: one may simple use sort, define fields 1 to 5 as comparision -keys, use , as field separator (-t) and select -unique entries only:

 sort -t, -k1,5 -u infile

However this will make the header line last. Use either -r for reversed sorting or separate the header line via e.g.

{ sed 1q infile ; sed 1d infile | sort -k1,5 -t, -u ; }
FelixJN
  • 12,616
  • 2
  • 27
  • 48