1

I have the following file (note that the ======== are actually present in the file):

start ======== id: 5713
start ======== id: 5911
start ======== id: 5911
end ========= id: 5911
start ======== id: 6111
end ========= id: 5713
start ======== id: 31117

I want to remove any two lines that have the same id and have respectively start and end in them.

Based on the above example, the output will be:

start ======== id: 5911
start ======== id: 6111
start ======== id: 31117

How to do this with bash, awk, sed ... ?

AdminBee
  • 21,637
  • 21
  • 47
  • 71
MOHAMED
  • 301
  • 1
  • 5
  • 15
  • @Kusalananda, I think it is not a duplicate, at least not of this. First field can be different. (my mistake...) – pLumo Sep 21 '21 at 15:45
  • @MOHAMED also, in the output `start ======== id: 5713` is missing – pLumo Sep 21 '21 at 15:46
  • It's a duplicate of this ---> [Remove lines based on duplicates within one column without sort](https://unix.stackexchange.com/questions/171091/remove-lines-based-on-duplicates-within-one-column-without-sort) – pLumo Sep 21 '21 at 15:47
  • 1
    @pLumo I see. It's reopened. ... and closed against the other dupe. – Kusalananda Sep 21 '21 at 15:49
  • I don't think it's a dupe at all. Here, we only remove a line if it is present twice BUT once with a `start` and once with an `end`. The solutions in the dupe don't handle this. @Kusalananda – terdon Sep 21 '21 at 15:53
  • this is the better answer to my question: https://unix.stackexchange.com/questions/412215/remove-lines-having-the-same-value-in-a-given-column – MOHAMED Sep 21 '21 at 16:40
  • you've shown `start` lines without a matching (and equal number of) `end` line(s) but is it also possible for the input to have an `end` line without a matching `start` line? – cas Sep 21 '21 at 17:05

2 Answers2

5

Using any awk in any shell on every Unix box this will print as many unpaired start and/or end statements as exist in your input:

$ cat tst.awk
$1 == "start" { beg[$NF] = $0; delta =  1 }
$1 == "end"   { end[$NF] = $0; delta = -1 }
{ cnt[$NF] += delta }
END {
    for ( key in cnt ) {
        for (i=1; i<=cnt[key]; i++) {
            print beg[key]
        }
        for (i=-1; i>=cnt[key]; i--) {
            print end[key]
        }
    }
}

$ awk -f tst.awk file
start ======== id: 5911
start ======== id: 6111
start ======== id: 31117

To better demonstrate using more comprehensive sample input:

$ cat file
start ======== id: 5713
start ======== id: 5911
start ======== id: 5911
start ======== id: 5911
end ========= id: 5911
start ======== id: 6111
end ========= id: 5713
end ========= id: 5713
start ======== id: 31117

$ awk -f tst.awk file
end ========= id: 5713
start ======== id: 5911
start ======== id: 5911
start ======== id: 6111
start ======== id: 31117
Ed Morton
  • 28,789
  • 5
  • 20
  • 47
  • 1
    +1: nice and well analyzed; you essentially print all the hits that contribute to `cnt[key]` not being zero. Adding that word of explanation might go a long way to explain the algorithmic approach you chose. Also replacing the `END` block with: `END{for ( key in cnt ) {mult=cnt[key]>=0?1:-1; for (i=mult; mult*i<=mult*cnt[key]; i+=mult) {print cnt[key]>=0?beg[key]:end[key]} }}` collapses the 2 inner `for` loops into 1. It works but unfortunately it's not really readable. – Cbhihe Sep 22 '21 at 07:18
  • 1
    Anyway. It's great solution. – K-attila- Sep 22 '21 at 13:24
0

Just sed and nl and sort :

nl  <filename> -s ":"|sort -t ":" -k 3 -k 2 | sed  -n ":x s/\n[0-9 ]*$//;/end[^\n]*$/{N;bx};s/\(.*\)[ 0-9]*:end .*id:\( [0-9]*\).*\n.*start.*id:\2[^0-9]*$/\1/;tx;s/\n$//;/start/{P;D};/^[ 0-9]*:end[^\n]*/{s/\n[0-9:]*$/$/;N;bx};/start/P;/end/P;" | sort -n| sed "s/[ 0-9]*://"

nl  tt -s ":"|sort -t ":" -k 3 -k 2 | sed  -n ":x s/\n[0-9 ]*$//;/end[^\n]*$/{N;bx};s/\(.*\)[ 0-9]*:end .*id:\( [0-9]*\).*\n.*start.*id:\2[^0-9]*$/\1/;tx;s/\n$//;/start/{P;D};/^[ 0-9]*:end[^\n]*/{s/\n[0-9:]*$/$/;N;bx};/start/P;/end/P;" | sort -n| sed "s/[ 0-9]*://"
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5911
start ======== id: 6111
end ======== id: 31117

If the order is not important (and if every end has the start line):

sort <filename> -t ":" -k 2|sed -e '/end/{N;d;} 

start ======== id: 31117 
start ======== id: 5911 
start ======== id: 6111 

This is better (need to repair, but working):

sort <filename> -t ":" -k 2 | sed  -n ":x ;/end[^\n]*$/{N;bx};s/\(.*\)end .*id:\( [0-9]*\).*start.*id:\2[^0-9]*$/\1/;tx;s/\n$//;/start/{P;D};/^end[^\n]*/{s/\n$/$/;N;bx};/start/P;/end/P"

cat tt
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
dggdgtfZZ
start ======== id: 5713
start ======== id: 5713
start ======== id: 5911
start ======== id: 5911
end ========= id: 5911
start ======== id: 6111
end ========= id: 5713
end ========= id: 5713
end ========= id: 5713
end ========= id: 5713
end ========= id: 5713
start ======== id: 31117
end ======== id: 31117
end ======== id: 31117



sort -t ":" -k 2 tt| sed  -n ":x ;/end[^\n]*$/{N;bx};s/\(.*\)end .*id:\( [0-9]*\).*start.*id:\2[^0-9]*$/\1/;tx;s/\n$//;/start/{P;D};/^end[^\n]*/{s/\n$/$/;N;bx};/start/P;/end/P" 
end ======== id: 31117
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5911
start ======== id: 6111
K-attila-
  • 624
  • 2
  • 13