Remove all duplicate word from string using shell script

Question

I have a string like

"aaa,aaa,aaa,bbb,bbb,ccc,bbb,ccc"

I want to remove duplicate word from string then output will be like

"aaa,bbb,ccc"

I tried This code Source

$ echo "zebra ant spider spider ant zebra ant" | xargs -n1 | sort -u | xargs

It is working fine with same value,but when I give my variable value then it is showing all duplicate word also.

How can I remove duplicate value.

UPDATE

My question is adding all corresponding value into a single string if user is same .I have data like this ->

   user name    | colour
    AAA         | red
    AAA         | black
    BBB         | red
    BBB         | blue
    AAA         | blue
    AAA         | red
    CCC         | red
    CCC         | red
    AAA         | green
    AAA         | red
    AAA         | black
    BBB         | red
    BBB         | blue
    AAA         | blue
    AAA         | red
    CCC         | red
    CCC         | red
    AAA         | green

In coding I fetch all distinct user then I concatenate color string successfully .For that I am using code -

while read the records 

    if [ "$c" == "" ]; then  #$c I defined global
        c="$colour1"
    else
        c="$c,$colour1" 
    fi

When I print this $c variable i get the output (For User AAA)

"red,black,blue,red,green,red,black,blue,red,green,"

I want to remove duplicate color .Then desired output should be like

"red,black,blue,green"

For this desired output i used above code

 echo "zebra ant spider spider ant zebra ant" | xargs -n1 | sort -u | xargs

but it is displaying the output with duplicate values .Like

"red,black,blue,red,green,red,black,blue,red,green," Thanks

Please clarify what is wrong with what you are using. I don't understand what you mean by "when I give my variable value". What value do you give? Where does it fail? — terdon, Mar 23 '17 at 12:57
`echo 'aaa aaa aaa bbb bbb ccc bbb ccc' | xargs -n1 | sort -u | xargs` gives `aaa bbb ccc`.. so you need to show exact code you tired and output you got.. with the string in variable: `s='aaa aaa aaa bbb bbb ccc bbb ccc'; echo "$s" | xargs -n1 | sort -u | xargs` — Sundeep, Mar 23 '17 at 13:01
string value comes dynamically. It is printing same value (contain duplicate value). — Urvashi, Mar 23 '17 at 13:02
yeah, show the code that failed, otherwise how would we know what could've gone wrong? — Sundeep, Mar 23 '17 at 13:02
@JacobVlijm yes order matter.I updated my question so you can easily understand. — Urvashi, Mar 24 '17 at 05:33
@Urvashi your string uses `,` as delimiter while the code you found worked on space as delimiter... why do you expect it to work on your string? all answers attempted will now be invalidated because of that — Sundeep, Mar 24 '17 at 05:38
when i tried that code that time i remove(,) and place space.But after that also i did not get it work. — Urvashi, Mar 24 '17 at 05:40
again we cannot debug code which you don't show, also your expected output `"red,black,blue,red,green,"` has `red` repeated... and `,` at end of string is required? — Sundeep, Mar 24 '17 at 05:45
@Sundeep red,black,blue,green" this is desired ,It was typing mistake.I corrected. — Urvashi, Mar 24 '17 at 05:48
try a simple `awk` + `paste` command instead of shell scripting, `awk '$1=="AAA" {if(!seen[$3]++) print $3}' input.txt | paste -sd,` where you need to replace `input.txt` with name of your file — Sundeep, Mar 24 '17 at 05:50
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/55947/discussion-between-urvashi-and-sundeep). — Urvashi, Mar 24 '17 at 05:54

George Vasiliou · Accepted Answer · 2017-03-23T14:20:21.430

18

One more awk, just for fun:

$ a="aaa bbb aaa bbb ccc aaa ddd bbb ccc"
$ echo "$a" | awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'
aaa bbb ccc ddd

By the way, even your solution works fine with variables:

$ b="zebra ant spider spider ant zebra ant" 
$ echo "$b" | xargs -n1 | sort -u | xargs
ant spider zebra

edited Mar 23 '17 at 14:20

answered Mar 23 '17 at 14:12

George Vasiliou

7,803
3
18
42

Neat approach. The only adjustment I had to make was to use `%s` instead of `%s%s`. The reason being is that I was doing a for loop through the results and two white spaces caused some challenges with regex matches. – JeremyCanfield Mar 20 '19 at 04:37
what if there are other word separators, such as dots, involved? – xeruf Jan 09 '22 at 13:58

score 13 · Answer 2 · edited Feb 09 '21 at 16:19

13

With tr, sort and uniq

echo "zebra ant spider spider ant zebra ant" | tr ' ' '\n' | sort -u

or

echo "zebra ant spider spider ant zebra ant" | tr ' ' '\n' | sort -u | xargs

to get one line

edited Feb 09 '21 at 16:19

SuperSandro2000

105
5

answered Mar 23 '17 at 12:55

Michael D.

2,820
16
24

1

You need to add `| xargs` to join the output to one line again – Philippos Mar 23 '17 at 12:59
4

Or use `sort -u`. Or even a `awk '!u[$0]++`. – Benoît Mar 23 '17 at 18:42
2

@Benoît Wow, I did not know about `sort -u`. I've been using `sort | uniq` all this time. The wasted keystrokes... – gardenhead Mar 24 '17 at 01:25

score 9 · Answer 3 · answered Mar 23 '17 at 15:25

9

$ echo "zebra ant spider spider ant zebra ant"  | awk -v RS="[ \n]+" '!n[$0]++' 
zebra
ant
spider

answered Mar 23 '17 at 15:25

JJoao

11,887
1
22
44

1

Very clever!!!! – George Vasiliou Mar 24 '17 at 00:54
@GeorgeVasiliou, thank you [or to tell the truth, very lazy :-) ] – JJoao Mar 24 '17 at 08:44

score 3 · Answer 4 · answered Mar 23 '17 at 12:52

3

With gnu sed:

sed ':s;s/\(\<\S*\>\)\(.*\)\<\1\>/\1\2/g;ts'

You may add ;s/ */ /g to remove dublicate spaces.

Functions like this: If a word is a second time in this line, remove it and start over until no dublication is found anymore.

answered Mar 23 '17 at 12:52

Philippos

13,237
2
37
76

What are `\<` and `\>`? – someonewithpc Mar 23 '17 at 20:19
@someonewithpc They match no character, but the beginning and end of a word to prevent substrings from being matched. – Philippos Mar 23 '17 at 21:29
Nice, but is that portable? Also, aren't words separated by whitespace? Seems redundant to match not whitespace followed by the end of a word. – someonewithpc Mar 23 '17 at 21:34
1

@someonewithpc No, it's not standard, that's why I wrote _gnu sed_. The nice part is that you don't have to handle first and last string separately – Philippos Mar 23 '17 at 21:44

score 2 · Answer 5 · answered Mar 23 '17 at 13:07

2

perl -lane '$,=$";print grep { ! $h{$_}++ } @F'

answered Mar 23 '17 at 13:07

ilkkachu · Answer 6 · 2017-03-23T13:58:20.180

2

Obligatory awk solution:

$ echo "ant zebra ant spider spider ant zebra ant" | 
   awk -vRS=" " -vORS=" " '!a[$1] {a[$1]++} END{ for (x in a) print x;  } ' ; echo
zebra ant spider

(The final echo is there for the newline)

edited Mar 23 '17 at 13:58

answered Mar 23 '17 at 13:52

ilkkachu

133,243
15
236
397

Plus one for the awk ! I was builting also an awk solution just for fun. There is a slight possibility words to be printed in random order at END section due to the random way that awk itterates in array keys. – George Vasiliou Mar 23 '17 at 14:14
Yes, they will be printed in an essentially random order. The `sort` solution doesn't keep the original order either, though. – ilkkachu Mar 23 '17 at 14:17
Yes, good point! Even sort prints in different order than input. – George Vasiliou Mar 23 '17 at 14:18
1

@ilkkachu Actually we don't need to wait for the input to end. We can make decision to print or not to print with a slight modification to your code: `awk -vRS=" " -vORS=" " '!a[$1]++ {print $1}' ; echo` This preserves the order. – Mar 23 '17 at 14:31

score 1 · Answer 7 · edited Jun 11 '20 at 14:16

Python

Option 1

#!/usr/bin/env python
# get_unique_words.py

import sys

l = []
for w in sys.argv[1].split(','):
  if w not in l:
    l += [ w ]
print ','.join(l)

Make executable, then call from Bash:

$ ./get_unique_words.py "aaa,aaa,aaa,bbb,bbb,ccc,bbb,ccc"
aaa,bbb,ccc

Or you could implement it as a Bash function, but the syntax is messy.

get_unique_words(){
  python -c "
l = []
for w in '$1'.split(','):
  if w not in l:
    l += [ w ]
print ','.join(l)"
}

Option 2

This option can become a one-liner if needed:

#!/usr/bin/env python
# get_unique_words.py

import sys

s_in = sys.argv[1]
l_in = s_in.split(',') # Turn string into a list.
set_out = set(l_in) # Turning a list into a set removes duplicates items.
s_out = ','.join(set_out) 
print s_out

In Bash:

get_unique_words(){
  python -c "print ','.join(set('$1'.split(',')))"
}

score 0 · Answer 8 · edited Mar 20 '19 at 20:54

0

cat filename | awk '{ delete a; for (i=1; i<=NF; i++) a[$i]++; n=asorti(a, b); for (i=1; i<=n; i++) printf b[i]" "; print "" }' > newfile

edited Mar 20 '19 at 20:54

George Vasiliou

7,803
3
18
42

answered Dec 02 '18 at 04:18

天津神こと

1

I do not get it – Pierre.Vriens Dec 02 '18 at 07:00
1

Your code lack explanation. With no explanation, it's difficult to follow what's happening. You also seem to make assumptions about the data that seems wrong (whitespace-delimited fields) and about the particular `awk` implementation being used (`asorti()` is not a standard `awk` function). – Kusalananda Mar 20 '19 at 21:15

score 0 · Answer 9 · answered Mar 20 '19 at 21:40

Using the original tabular data in the file called file:

sed '1d' file | sort -u |
awk '{ color[$1] = ( color[$1] == "" ? $3 : color[$1] "," $3 ) }
     END { for (user in color) print user, color[user] }'

This generates

CCC red
BBB blue,red
AAA black,blue,green,red

The three steps of the pipeline:

The sed command removes the first line which is a header that we don't want to read.

The sort command gives us unique lines. The sample data after the sort looks like

AAA         | black
AAA         | blue
AAA         | green
AAA         | red
BBB         | blue
BBB         | red
CCC         | red

The awk command takes this data and produces a comma-delimited string for each user in the array color (where the username is the key into the array). At the end (in the END block), all collected data is outputted.

score 0 · Answer 10 · answered Jan 09 '22 at 14:10

So, I had a weird issue where each file title was doubled, and sought a way to fix that:

❯ ls
01 Miracle Miracle.mp3
02 Spirit of Life Spirit of Life.mp3
03 Let It Be (feat. Veela).mp3
04 Embrace Embrace.mp3
05 Don't Let Me Down (feat. Cat Martin) Don't Let Me Down (feat. Cat Martin).mp3
06 My Love My Love.mp3
07 The Drift The Drift.mp3
08 Lucid Truth Lucid Truth.mp3
09 Love At Heart Love At Heart.mp3
10 Sarajevo (Blackmill Remix) Sarajevo (Blackmill Remix).mp3
11 Fortune Soul Fortune Soul.mp3

I adapted the sed script of Philippos:

❯ ls | sed ':s;s/ \([^ ]\+\) \(.*\)\1/ \1 \2/g;ts'
01 Miracle .mp3
02 Spirit of Life   .mp3
03 Let It Be (feat. Veela).mp3
04 Embrace .mp3
05 Don't Let Me Down (feat. Cat Martin)       .mp3
06 My Love  .mp3
07 The Drift  .mp3
08 Lucid Truth  .mp3
09 Love At Heart   .mp3
10 Sarajevo (Blackmill Remix)   .mp3
11 Fortune Soul  .mp3

And used it to fix the filenames:

❯ find -type f -exec sh -c 'mv -v "{}" "$(echo "{}" | sed ":s;s/ \([^ ]\+\) \(.*\)\1/ \1 \2/g;ts" | sed "s/ \+\././")"' \;
mv: './03 Let It Be (feat. Veela).mp3' and './03 Let It Be (feat. Veela).mp3' are the same file
renamed './06 My Love My Love.mp3' -> './06 My Love.mp3'
renamed './10 Sarajevo (Blackmill Remix) Sarajevo (Blackmill Remix).mp3' -> './10 Sarajevo (Blackmill Remix).mp3'
renamed './07 The Drift The Drift.mp3' -> './07 The Drift.mp3'
renamed './04 Embrace Embrace.mp3' -> './04 Embrace.mp3'
renamed './05 Don'\''t Let Me Down (feat. Cat Martin) Don'\''t Let Me Down (feat. Cat Martin).mp3' -> './05 Don'\''t Let Me Down (feat. Cat Martin).mp3'
renamed './02 Spirit of Life Spirit of Life.mp3' -> './02 Spirit of Life.mp3'
renamed './09 Love At Heart Love At Heart.mp3' -> './09 Love At Heart.mp3'
renamed './01 Miracle Miracle.mp3' -> './01 Miracle.mp3'
renamed './11 Fortune Soul Fortune Soul.mp3' -> './11 Fortune Soul.mp3'
renamed './08 Lucid Truth Lucid Truth.mp3' -> './08 Lucid Truth.mp3'

score 0 · Answer 11 · answered Aug 03 '22 at 23:27

Using Raku (formerly known as Perl_6)

raku -ne 'state %h; next if $++ == 0; .split(/ "|" | \s+ /, :skip-empty) andthen %h.=append($_.[0] => $_.[1]); END say(.key => .value.unique) for %h.sort;'

Sample Input:

   user name    | colour
    AAA         | red
    AAA         | black
    BBB         | red
    BBB         | blue
    AAA         | blue
    AAA         | red
    CCC         | red
    CCC         | red
    AAA         | green
    AAA         | red
    AAA         | black
    BBB         | red
    BBB         | blue
    AAA         | blue
    AAA         | red
    CCC         | red
    CCC         | red
    AAA         | green

Sample Output:

AAA => (red black blue green)
BBB => (red blue)
CCC => (red)

Briefly, input is read using Raku's -ne "linewise, non-autoprinting" commandline flags. A hash %h is stated (once). The statement next if $++ == 0; is used to skip the header line. Each line is split on either | pipe or \s+ whitespace, empty elements are skipped. Resultant username and colour elements are appended to the %h hash in a username => colour (key => value) relationship. Because keys must be unique in the %h hash, .values accrue to the appropriate .key.

At the END (after all lines are read), the code says keys and their unique values. To get output where key/value columns are separated by \t tabs, change say into put.

score -1 · Answer 12 · edited Jul 31 '22 at 07:06

-1

a="aaa aaa aaa bbb bbb ccc bbb ccc"
for item in $a
do
   echo $item
done | sort -u | (while read i; do ans="$ans $i"; done ; echo $ans)

Explanation:

We are reading line in the "for" loop and convert each word in the original line into list's item. So, we are receiving a list of original words.
Then we sort the list obtained in the first step and remove all duplicates using "sort -u".
At the last step we are reading our deduplicated list and transform it into a single string.

edited Jul 31 '22 at 07:06

Prokhozhii

115
6

answered Mar 24 '17 at 00:18

Tododo Fly

11
1

Please add an explanation on how your code works and why you did this and that. – xhienne Mar 24 '17 at 01:37

Remove all duplicate word from string using shell script

12 Answers12

Python

Option 1

Option 2