How to remove lines that have same id string

Question

I have the following file (note that the ======== are actually present in the file):

start ======== id: 5713
start ======== id: 5911
start ======== id: 5911
end ========= id: 5911
start ======== id: 6111
end ========= id: 5713
start ======== id: 31117

I want to remove any two lines that have the same id and have respectively start and end in them.

Based on the above example, the output will be:

start ======== id: 5911
start ======== id: 6111
start ======== id: 31117

How to do this with bash, awk, sed ... ?

@Kusalananda, I think it is not a duplicate, at least not of this. First field can be different. (my mistake...) — pLumo, Sep 21 '21 at 15:45
@MOHAMED also, in the output `start ======== id: 5713` is missing — pLumo, Sep 21 '21 at 15:46
It's a duplicate of this ---> [Remove lines based on duplicates within one column without sort](https://unix.stackexchange.com/questions/171091/remove-lines-based-on-duplicates-within-one-column-without-sort) — pLumo, Sep 21 '21 at 15:47
@pLumo I see. It's reopened. ... and closed against the other dupe. — Kusalananda, Sep 21 '21 at 15:49
I don't think it's a dupe at all. Here, we only remove a line if it is present twice BUT once with a `start` and once with an `end`. The solutions in the dupe don't handle this. @Kusalananda — terdon, Sep 21 '21 at 15:53
this is the better answer to my question: https://unix.stackexchange.com/questions/412215/remove-lines-having-the-same-value-in-a-given-column — MOHAMED, Sep 21 '21 at 16:40
you've shown `start` lines without a matching (and equal number of) `end` line(s) but is it also possible for the input to have an `end` line without a matching `start` line? — cas, Sep 21 '21 at 17:05

score 5 · Accepted Answer · answered Sep 21 '21 at 17:14

Using any awk in any shell on every Unix box this will print as many unpaired start and/or end statements as exist in your input:

$ cat tst.awk
$1 == "start" { beg[$NF] = $0; delta =  1 }
$1 == "end"   { end[$NF] = $0; delta = -1 }
{ cnt[$NF] += delta }
END {
    for ( key in cnt ) {
        for (i=1; i<=cnt[key]; i++) {
            print beg[key]
        }
        for (i=-1; i>=cnt[key]; i--) {
            print end[key]
        }
    }
}

$ awk -f tst.awk file
start ======== id: 5911
start ======== id: 6111
start ======== id: 31117

To better demonstrate using more comprehensive sample input:

$ cat file
start ======== id: 5713
start ======== id: 5911
start ======== id: 5911
start ======== id: 5911
end ========= id: 5911
start ======== id: 6111
end ========= id: 5713
end ========= id: 5713
start ======== id: 31117

$ awk -f tst.awk file
end ========= id: 5713
start ======== id: 5911
start ======== id: 5911
start ======== id: 6111
start ======== id: 31117

+1: nice and well analyzed; you essentially print all the hits that contribute to `cnt[key]` not being zero. Adding that word of explanation might go a long way to explain the algorithmic approach you chose. Also replacing the `END` block with: `END{for ( key in cnt ) {mult=cnt[key]>=0?1:-1; for (i=mult; mult*i<=mult*cnt[key]; i+=mult) {print cnt[key]>=0?beg[key]:end[key]} }}` collapses the 2 inner `for` loops into 1. It works but unfortunately it's not really readable. — Cbhihe, Sep 22 '21 at 07:18

K-attila- · Answer 2 · 2021-09-24T09:04:19.670

Just sed and nl and sort :

nl  <filename> -s ":"|sort -t ":" -k 3 -k 2 | sed  -n ":x s/\n[0-9 ]*$//;/end[^\n]*$/{N;bx};s/\(.*\)[ 0-9]*:end .*id:\( [0-9]*\).*\n.*start.*id:\2[^0-9]*$/\1/;tx;s/\n$//;/start/{P;D};/^[ 0-9]*:end[^\n]*/{s/\n[0-9:]*$/$/;N;bx};/start/P;/end/P;" | sort -n| sed "s/[ 0-9]*://"

nl  tt -s ":"|sort -t ":" -k 3 -k 2 | sed  -n ":x s/\n[0-9 ]*$//;/end[^\n]*$/{N;bx};s/\(.*\)[ 0-9]*:end .*id:\( [0-9]*\).*\n.*start.*id:\2[^0-9]*$/\1/;tx;s/\n$//;/start/{P;D};/^[ 0-9]*:end[^\n]*/{s/\n[0-9:]*$/$/;N;bx};/start/P;/end/P;" | sort -n| sed "s/[ 0-9]*://"
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5911
start ======== id: 6111
end ======== id: 31117

If the order is not important (and if every end has the start line):

sort <filename> -t ":" -k 2|sed -e '/end/{N;d;} 

start ======== id: 31117 
start ======== id: 5911 
start ======== id: 6111

This is better (need to repair, but working):

sort <filename> -t ":" -k 2 | sed  -n ":x ;/end[^\n]*$/{N;bx};s/\(.*\)end .*id:\( [0-9]*\).*start.*id:\2[^0-9]*$/\1/;tx;s/\n$//;/start/{P;D};/^end[^\n]*/{s/\n$/$/;N;bx};/start/P;/end/P"

cat tt
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
dggdgtfZZ
start ======== id: 5713
start ======== id: 5713
start ======== id: 5911
start ======== id: 5911
end ========= id: 5911
start ======== id: 6111
end ========= id: 5713
end ========= id: 5713
end ========= id: 5713
end ========= id: 5713
end ========= id: 5713
start ======== id: 31117
end ======== id: 31117
end ======== id: 31117



sort -t ":" -k 2 tt| sed  -n ":x ;/end[^\n]*$/{N;bx};s/\(.*\)end .*id:\( [0-9]*\).*start.*id:\2[^0-9]*$/\1/;tx;s/\n$//;/start/{P;D};/^end[^\n]*/{s/\n$/$/;N;bx};/start/P;/end/P" 
end ======== id: 31117
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5713
start ======== id: 5911
start ======== id: 6111

Indeed. Too many restrictions. – K-attila- Sep 23 '21 at 07:29 — K-attila-, Sep 23 '21 at 07:29
I repaired, it's works now and in correct order. – K-attila- Sep 24 '21 at 09:20 — K-attila-, Sep 24 '21 at 09:20

How to remove lines that have same id string

2 Answers2