15

I have a file with one column with names that repeat a number of times each. I want to condense each repeat into one, while keeping any other repeats of the same name that are not adjacent to other repeats of the same name.

E.g. I want to turn the left side to the right side:

Golgb1    Golgb1    
Golgb1    Akna
Golgb1    Spata20
Golgb1    Golgb1
Golgb1    Akna
Akna
Akna
Akna
Spata20
Spata20
Spata20
Golgb1
Golgb1
Golgb1
Akna
Akna
Akna

This is what I've been using: perl -ne 'print if ++$k{$_}==1' file.txt > file2.txt However, this method only keeps one representative from the left (i.e. Golb1 and Akna are not repeated).

Is there a way to keep unique names for each block, while keeping names that repeat in multiple, non-adjacent blocks?

Age87
  • 549
  • 5
  • 11

5 Answers5

25

uniq will do this for you:

$ uniq inputfile
Golgb1
Akna
Spata20
Golgb1
Akna
DopeGhoti
  • 73,792
  • 8
  • 97
  • 133
10

Awk solution:

awk '$1 != name{ print }{ name = $1 }' file.txt

The output:

Golgb1
Akna
Spata20
Golgb1
Akna
RomanPerekhrest
  • 29,703
  • 3
  • 43
  • 67
6

Try this - save the previous line and compare against current line

$ perl -ne 'print if $p ne $_; $p=$_' ip.txt
Golgb1
Akna
Spata20
Golgb1
Akna

You've tagged uniq as well - did you try it?

$ uniq ip.txt
Golgb1
Akna
Spata20
Golgb1
Akna
Sundeep
  • 11,753
  • 2
  • 26
  • 57
1

With sed it can be done as follows:

sed -e '$!N;/^\(.*\)\n\1$/!P;D' input_file

Here we have in the pattern space at any time 2 lines. When the comparison between them fails we print the first one and chop it from the front and go back and append the next line into the pattern space. Rinse...repeat

Utilizing Perl in the slurp mode we treat the whole file as one long string on which the regex is applied which does the comparison for you.

perl -0777pe 's//$1/ while /^(.*\n)\1+/gm' input_file
Rakesh Sharma
  • 755
  • 4
  • 3
0

Question about Rakesh Sharma's sed solution.

What if you have a input file such as:

-126.1 48.206
-126.106 48.21
-126.11 48.212
-126.114 48.214
-126.116 48.216
-126.118 48.216
-126.128 48.222
-126.136 48.226

And you want an output file to be:

-126.1 48.206
-126.106 48.21
-126.11 48.212
-126.114 48.214
-126.116 48.216
-126.128 48.222
-126.136 48.226

Note the missing:

-126.118 48.216

I know the command I want is similar to your solution:

sed -e '$!N;/^\(.*\)\n\1$/!P;D' input_file

Cannot alter it in the right way to print both columns and only be sorted in this special way with column 2 values. Any tips?

MattS
  • 11
  • 2
  • `sed -e '$!N' -e '/.*\.\([0-9]*\)\n.*\.\1$/!{P;D;}' -e 's/\n.*//;s/^/\n/;D'` will delete the subsequent repeating elements. Note: This requires `GNU sed`. For `POSIX` behavior, it needs slight alteration. – Rakesh Sharma Jun 28 '18 at 08:02