Remove Common lines in a file

Question

I have two files fileA & fileB

fileA has lot of IPs & fileB has less IPs. How can we do

fileA - fileB = fileC (File without common IPs)

fileA

1.1.1.1
2.2.2.2
3.3.3.3
4.4.4.4
5.5.5.5

fileB

4.4.4.4
1.1.1.1

fileC

2.2.2.2
3.3.3.3
5.5.5.5

I got lot of options on google but couldn't get anything relevant

You should have included in your example an IP in fileB that doesn't exist in fileA so we could see if that should appear in the output or not. Everyones making different assumptions... — Ed Morton, Sep 02 '22 at 13:40
Do you want the **intersection** or actually the **complement** ? Please read the descriptions (definitions) in https://catonmat.net/set-operations-in-unix-shell — QuartzCristal, Sep 02 '22 at 23:55

score 8 · Answer 1 · answered Sep 02 '22 at 09:13

8

The comm tool may be useful here, particularly if you don't care that the results are sorted in alphanumerical order:

comm -23 <( sort -u fileA ) <( sort -u fileB ) >fileC

See man sort and man comm for reference details of their use.

answered Sep 02 '22 at 09:13

roaima

107,089
14
139
261

That is assuming that you want the complement of fileA (sutract fileB from FileA, `fileA - fileB`). If what is wanted is the *Symmetric Difference* (elements that are in set1 or in set2 but not both) you could use `sort <(sort -u fileA) <(sort -u fileB) | uniq -u` (or `comm -3`). – QuartzCristal Sep 03 '22 at 00:53
1

@QuartzCristal yes. I read the question as wanting to use FileB to eliminate IP addresses from FileA. – roaima Sep 03 '22 at 07:24
Yes ! I agree, that is my perception as well. But then, I read a sentence like *(File without common IPs)* and I left wondering if what is meant is to eliminate repeated IPs anywhere they appear. Well, nevermind, there is nothing that you should do with your answer, not until the OP could clarify what is he meaning to say. – QuartzCristal Sep 03 '22 at 08:04

Marius_Couet · Accepted Answer · 2022-09-02T13:53:14.703

To do fileA - fileB, you can use awk (this will not get you IP only in fileB):

awk 'NR==FNR{a[$0];next}!($0 in a)' fileB fileA

NR refers to the total record number and FNR refers to the record number (typically the line number) in the current file. So if a line exist in the first file, it won't be displayed from the second.

If you need to get rid of duplicate line in fileA use :

awk 'NR==FNR{a[$0]++;next}!a[$0]++' fileB fileA

Kusalananda · Answer 3 · 2022-09-02T12:39:37.673

3

The question can be interpreted in a couple of different ways.

I will assume that each line is unique in the file it occurs in.

Assuming you want to remove the entries from fileA that are also found in fileB.

This removes the IP addresses found in fileB from the ones in fileA:
```
grep -v -Fx -f fileB fileA >fileC
```
The options used with grep here ensure that the patterns (lines read from fileB using -f) are treated as strings rather than as regular expressions (-F), and that we are matching whole lines rather than substrings (-x). We also invert the sense of the match with -v to output all lines from fileA that does not match any of the lines in fileB.
Assuming you want to get all entries that are unique to fileA or that are unique to fileB:

The following outputs the lines that are not duplicated across the files. It uses -u with uniq, which is a non-standard but commonly available option used for outputting lines that are not duplicated over subsequent lines.
```
sort fileA fileB | uniq -u >fileC
```

edited Sep 02 '22 at 12:39

answered Sep 02 '22 at 09:15

Kusalananda

320,670
36
633
936

`sort fileA fileB fileB | uniq -u` Need to duplicate the common IPs. Edit the answer pls. – K-attila- Sep 02 '22 at 10:42
@K-att- That is correct, but only if fileB contains values that doesn't exist in fileA. That is not what the OP supplied as examples, but yes `sort fileA fileB fileB | uniq -u` is the correct solution. – QuartzCristal Sep 02 '22 at 11:59
@K-att- My intention was to list the entries that were not repeated between the files. If I use `fileB` twice, then I would get no entry from that file. If `fileB` contains an entry that is not in `fileA`, then I was assuming that the user wanted to see it. This is what I meant by there being different interpretations of the problem. – Kusalananda Sep 02 '22 at 12:04
fileA - fileB = fileC (File without common IPs) – K-attila- Sep 02 '22 at 12:12
@K-att- Exactly. If you use `fileB` twice in the call to `sort`, then you will get _no_ elements from `fileB`, even if the file contains elements that are not common to both files. – Kusalananda Sep 02 '22 at 12:14
@Kusalananda You are interpreting the question as getting the intersection of both files and printing what is not common to both of them. But the procedure you decided to use (with `sort`) require that there are not repeated values in `fileA`. If you add a value `6.6.6.6` twice in `fileA` it will not be printed on the output. – QuartzCristal Sep 02 '22 at 23:41
@Kusalananda An alternative interpretation of `fileA-fileB` is that of subtraction (removal of values in `fileB` from `fileA`). That is what you did with the `grep` example and the `sort` command is **not** an equivalent. – QuartzCristal Sep 02 '22 at 23:41
@Kusalananda The description in https://catonmat.net/set-operations-in-unix-shell might be of help. And the solutions already exist in https://unix.stackexchange.com/questions/11343/linux-tools-to-treat-files-as-sets-and-perform-set-operations-on-them – QuartzCristal Sep 02 '22 at 23:57
1

@QuartzCristal Yes, I assume that each file contains lines that are unique within that file. This is an assumption that is already part of my answer. I also say that there are different ways of interpreting the question, then I give two different answers, together with two different interpretation. This is also part of the text already. – Kusalananda Sep 03 '22 at 00:13
Yes, that is true. But plain rejection is not better than trying to include some/any of the helpful points: (1) you can comment (note, write, say, include) that your assumption of that each file is a set (no repeated lines) could be enforced with `<(sort -u file A)`. (2) That the equivalent to your grep solution is `sort fileA fileB fileB | uniq -u` with the `sort` command. (3) That a seemingly cleaner equivalent to `sort fileA FileB | uniq -u` which enforce assumptions is `comm -13 <(sort -u fileB) <(sort -u fileA)`. (4) And that there is another existing Unix answer that might be of help. – QuartzCristal Sep 03 '22 at 00:42

Özgür Murat Sağdıçoğlu · Answer 4 · 2022-09-02T10:02:54.110

1

If preservation of order is not important, you can first remove duplicates in first file and concatenate the output with second file twice (for removing anything unique to it) and print only non-duplicate lines with uniq -u.

sort -u fileA | cat - fileB fileB | sort | uniq -u

edited Sep 02 '22 at 10:02

answered Sep 02 '22 at 08:44

Özgür Murat Sağdıçoğlu

203
1
5

Remove Common lines in a file

4 Answers4