0

I have a csv file with ~4000 lines, each one containing between 2 and 30 names separated by commas. The names are including titles (for example mr. X Adams or ms. Y Sanders). Some names exist multiple times within the same line, and I would like to have the multiples within the same line removed. It is in a file "input.csv" and another file "output.csv" should be the end result.

Example, I have:

mr. 1,mr. 2,mr. 3,mr. 1,mr. 4
prof. x,prof. y,prof. x
mr. 1,prof y

which should become

mr. 1,mr. 2,mr. 3,mr. 4   (mr. 1 was already meantioned so it should be removed)
prof. x,prof. y           (prof. x was already mentioned so it should be removed)
mr. 1,prof y              (even though both were already mentioned in the same file, they were not mentioned within this line so they may remain)
Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
  • @αғsнιη It's not a dupe of that question. That is much more liberal with matching, e.g. case-insensitive, Persian/Arabic. – Sparhawk Oct 08 '18 at 11:32
  • 1
    @αғsнιη But it's clearly different in some cases. That question would treat `Mr X` and `mR x` as duplicates. This one would not. Also, the code is necessarily much more convoluted. – Sparhawk Oct 08 '18 at 11:39
  • 1
    Possible duplicate of [remove duplicated pattern/entries within each field in CSV file](https://unix.stackexchange.com/questions/432151/remove-duplicated-pattern-entries-within-each-field-in-csv-file) – Romeo Ninov Oct 13 '18 at 05:01
  • A `duplicated pattern/entries within each **field**` is clearly **not** the same as `duplicated field within each **row**`. –  Oct 14 '18 at 14:29

1 Answers1

0

you can try:

#!/bin/bash

cat file | while IFS= read -r line ; do 
echo "$line" | tr , '\n' | sort -u | tr '\n' , | sed 's/,$/\n/' ; 
done 
  • This will fail if any field contain `,`, like `one,"some, text in one field",three` which most csv may contain. In short: do not parse csv with text tools. –  Oct 08 '18 at 11:43