Find intersection of lines in two files

Question

If I have two files (with single columns), one like so (file1)

And the second file (file2)

How do I find elements that are common in both files (intersection)? The expected output in this example is

67
102

Note that number of items (lines) in each file differs. Numbers and strings may be mixed. They may not be necessarily sorted. Each item only appears once.

UPDATE:

Time check based on some of the answers below.

# generate some data
>shuf -n2000000 -i1-2352452 > file1
>shuf -n2000000 -i1-2352452 > file2

#@ilkkachu
>time (join <(sort "file1") <(sort "file2") > out1)
real    0m15.391s
user    0m14.896s
sys     0m0.205s

>head out1
1
10
100
1000
1000001

#@Hauke
>time (grep -Fxf "file1" "file2" > out2)
real    0m7.652s
user    0m7.131s
sys     0m0.316s

>head out2
1047867
872652
1370463
189072
1807745

#@Roman
>time (comm -12 <(sort "file1") <(sort "file2") > out3)
real    0m13.533s
user    0m13.140s
sys     0m0.195s

>head out3
1
10
100
1000
1000001

#@ilkkachu
>time (awk 'NR==FNR { lines[$0]=1; next } $0 in lines' "file1" "file2" > out4)
real    0m4.587s
user    0m4.262s
sys     0m0.195s

>head out4
1047867
872652
1370463
189072
1807745

#@Cyrus   
>time (sort file1 file2 | uniq -d > out8)
real    0m16.106s
user    0m15.629s
sys     0m0.225s

>head out8
1
10
100
1000
1000001


#@Sundeep
>time (awk 'BEGIN{while( (getline k < "file1")>0 ){a[k]}} $0 in a' file2 > out5)
real    0m4.213s
user    0m3.936s
sys     0m0.179s

>head out5
1047867
872652
1370463
189072
1807745

#@Sundeep
>time (perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <file1 file2 > out6)
real    0m3.467s
user    0m3.180s
sys     0m0.175s

>head out6
1047867
872652
1370463
189072
1807745

The perl version was the fastest followed by awk. All output files had the same number of rows.

For the sake of comparison, I have sorted the output numerically so that the output is identical.

#@ilkkachu
>time (join <(sort "file1") <(sort "file2") | sort -k1n > out1)
real    0m17.953s
user    0m5.306s
sys     0m0.138s

#@Hauke
>time (grep -Fxf "file1" "file2" | sort -k1n > out2)
real    0m12.477s
user    0m11.725s
sys     0m0.419s

#@Roman
>time (comm -12 <(sort "file1") <(sort "file2") | sort -k1n > out3)
real    0m16.273s
user    0m3.572s
sys     0m0.102s

#@ilkkachu
>time (awk 'NR==FNR { lines[$0]=1; next } $0 in lines' "file1" "file2" | sort -k1n > out4)
real    0m8.732s
user    0m8.320s
sys     0m0.261s

#@Cyrus   
>time (sort file1 file2 | uniq -d > out8)
real    0m19.382s
user    0m18.726s
sys     0m0.295s

#@Sundeep
>time (awk 'BEGIN{while( (getline k < "file1")>0 ){a[k]}} $0 in a' file2 | sort -k1n > out5)
real    0m8.758s
user    0m8.315s
sys     0m0.255s

#@Sundeep
>time (perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <file1 file2 | sort -k1n > out6)
real    0m7.732s
user    0m7.300s
sys     0m0.310s

>head out1
1
2
3
4
5

All outputs are now identical.

Regarding your timing results: You should also time the `join` running on pre-sorted files. — Kusalananda, Jan 20 '18 at 11:39
True. They are not exactly comparable since one sorts and other doesn't. I need to be going. I will do that when I am back. — mindlessgreen, Jan 20 '18 at 11:46
In your performance comparison you leave out the one answer which has been written with performance in mind? — Hauke Laging, Jan 20 '18 at 11:50
@Hauke I realised that after I ran the test. I will redo it with changes a bit later. — mindlessgreen, Jan 20 '18 at 12:20
Updated time tests. @Hauke Your awk solution took too long to complete. I cancelled it after 8 mins. Perhaps there is something wrong somewhere. So I didn't include it in the updated timings. — mindlessgreen, Jan 20 '18 at 17:43
I made [another benchmark](https://transang.me/files-intersection-benchmark/) on all method with little change in the condition and the results are also different. `comm` is the fastest, while `join` gives incorrect result due to localization, `grep` runs forever, etc... — Sang, Nov 14 '19 at 14:27

score 39 · Answer 1 · answered Jan 20 '18 at 10:34

39

Simple comm + sort solution:

comm -12 <(sort file1) <(sort file2)

-12 - suppress column 1 and 2 (lines unique to FILE1 and FILE2 respectively), thus outputting only common lines (that appear in both files)

answered Jan 20 '18 at 10:34

RomanPerekhrest

29,703
3
43
67

6

My first thought, too. Given that `comm` is _the_ purpose-built tool for intersection, I find it quite annoying that the `awk` solution is twice as fast! – bishop Jan 21 '18 at 04:08
1

@bishop Interestingly, the `join` command takes 75% of the time that `comm` takes, although it does not perform exactly the same task. – Michael Goldshteyn Nov 05 '20 at 14:53

ilkkachu · Accepted Answer · 2018-01-20T10:56:30.193

15

In awk, this loads the first file fully in memory:

$ awk 'NR==FNR { lines[$0]=1; next } $0 in lines' file1 file2 
67
102

Or, if you want to keep track of how many times a given line appears:

$ awk 'NR==FNR { lines[$0] += 1; next } lines[$0] {print; lines[$0] -= 1}' file1 file2

join could do that, though it does require the input files to be sorted, so you need to do that first, and doing it loses the original ordering:

$ join <(sort file1) <(sort file2)
102
67

edited Jan 20 '18 at 10:56

answered Jan 20 '18 at 10:30

ilkkachu

133,243
15
236
397

1

You could do that if you were sure that each entry occurs only once per file. I just realize that my grep answer has (nearly) the same problem. – Hauke Laging Jan 20 '18 at 10:36
@ilkkachu With the awk approach, how would I put in a complex file name in place of 'a'? For example the part with `a[$0]`, a file name like `file-this-bla-2.txt[$0]` won't work. – mindlessgreen Jan 20 '18 at 10:49
1

@rmf, sorry, the `a` inside the awk script is just the array that holds the lines, the name has nothing to do with the filename I used. – ilkkachu Jan 20 '18 at 10:57
@HaukeLaging, well, we can add a counter. – ilkkachu Jan 20 '18 at 10:57
Yeah, that would be the second-best approach... 8-) – Hauke Laging Jan 20 '18 at 10:59
@ilkkachu Oh, I get it now. Sorry. I confused the array `a` with file `a`. The counter is useful too. Thanks :-) – mindlessgreen Jan 20 '18 at 11:12

Hauke Laging · Answer 3 · 2018-01-20T10:45:39.763

8

awk

awk 'NR==FNR { p[NR]=$0; next; }
   { for(val in p) if($0==p[val]) { delete p[val]; print; } }' file1 file2

This is the good solution because (for large files) it should be the fastest as it omits both printing the same entry more than once and checking an entry again after it has been matched.

grep

grep -Fxf file1 file2

This would output the same entry several times if it occurs more than once in file2.

sort

For fun (should be much slower than grep):

sort -u file1 >t1
sort -u file2 >t2
sort t1 t2 | uniq -d

edited Jan 20 '18 at 10:45

answered Jan 20 '18 at 10:28

Hauke Laging

88,146
18
125
174

1

`grep -Fxf`, I think. – ilkkachu Jan 20 '18 at 10:31
@ilkkachu Indeed – Hauke Laging Jan 20 '18 at 10:37
Explanation of `grep -Fxf file1 file2`: `-F` makes it match the literal text instead of a pattern. `-x` forces it to match the entire line, thus `ABCD` won't match `ABCDE`. `-f` means to use a file as a source of patterns to match against. – Christopher Bottoms Jan 19 '21 at 16:58

score 3 · Answer 4 · answered Jan 20 '18 at 11:18

3

With GNU uniq:

sort file1 file2 | uniq -d

Output:

102
67

answered Jan 20 '18 at 11:18

Cyrus

12,059
3
29
53

3

Though I think this will also print those lines that appear twice in just one input file – ilkkachu Jan 20 '18 at 11:21
@ilkkachu: rmf wrote `Each item only appears once`. – Cyrus Jan 20 '18 at 11:22

score 3 · Answer 5 · answered Jan 20 '18 at 13:10

slightly different awk version and equivalent perl version

time reported for three consecutive runs

$ # just realized shuf -n2000000 -i1-2352452 can be used too ;)
$ shuf -i1-2352452 | head -n2000000 > f1
$ shuf -i1-2352452 | head -n2000000 > f2

$ time awk 'NR==FNR{a[$1]; next} $0 in a' f1 f2 > t1
real    0m3.322s
real    0m3.094s
real    0m3.029s

$ time awk 'BEGIN{while( (getline k < "f1")>0 ){a[k]}} $0 in a' f2 > t2
real    0m2.731s
real    0m2.777s
real    0m2.801s

$ time perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <f1 f2 > t3
real    0m2.643s
real    0m2.690s
real    0m2.630s

$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical

$ du -h f1 f2 t1
15M f1
15M f2
13M t1

Find intersection of lines in two files

UPDATE:

5 Answers5

Linked