42

If I have two files (with single columns), one like so (file1)

34
67
89
92
102
180
blue2
3454

And the second file (file2)

23
56
67
69
102
200

How do I find elements that are common in both files (intersection)? The expected output in this example is

67
102

Note that number of items (lines) in each file differs. Numbers and strings may be mixed. They may not be necessarily sorted. Each item only appears once.

UPDATE:

Time check based on some of the answers below.

# generate some data
>shuf -n2000000 -i1-2352452 > file1
>shuf -n2000000 -i1-2352452 > file2

#@ilkkachu
>time (join <(sort "file1") <(sort "file2") > out1)
real    0m15.391s
user    0m14.896s
sys     0m0.205s

>head out1
1
10
100
1000
1000001

#@Hauke
>time (grep -Fxf "file1" "file2" > out2)
real    0m7.652s
user    0m7.131s
sys     0m0.316s

>head out2
1047867
872652
1370463
189072
1807745

#@Roman
>time (comm -12 <(sort "file1") <(sort "file2") > out3)
real    0m13.533s
user    0m13.140s
sys     0m0.195s

>head out3
1
10
100
1000
1000001

#@ilkkachu
>time (awk 'NR==FNR { lines[$0]=1; next } $0 in lines' "file1" "file2" > out4)
real    0m4.587s
user    0m4.262s
sys     0m0.195s

>head out4
1047867
872652
1370463
189072
1807745

#@Cyrus   
>time (sort file1 file2 | uniq -d > out8)
real    0m16.106s
user    0m15.629s
sys     0m0.225s

>head out8
1
10
100
1000
1000001


#@Sundeep
>time (awk 'BEGIN{while( (getline k < "file1")>0 ){a[k]}} $0 in a' file2 > out5)
real    0m4.213s
user    0m3.936s
sys     0m0.179s

>head out5
1047867
872652
1370463
189072
1807745

#@Sundeep
>time (perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <file1 file2 > out6)
real    0m3.467s
user    0m3.180s
sys     0m0.175s

>head out6
1047867
872652
1370463
189072
1807745

The perl version was the fastest followed by awk. All output files had the same number of rows.

For the sake of comparison, I have sorted the output numerically so that the output is identical.

#@ilkkachu
>time (join <(sort "file1") <(sort "file2") | sort -k1n > out1)
real    0m17.953s
user    0m5.306s
sys     0m0.138s

#@Hauke
>time (grep -Fxf "file1" "file2" | sort -k1n > out2)
real    0m12.477s
user    0m11.725s
sys     0m0.419s

#@Roman
>time (comm -12 <(sort "file1") <(sort "file2") | sort -k1n > out3)
real    0m16.273s
user    0m3.572s
sys     0m0.102s

#@ilkkachu
>time (awk 'NR==FNR { lines[$0]=1; next } $0 in lines' "file1" "file2" | sort -k1n > out4)
real    0m8.732s
user    0m8.320s
sys     0m0.261s

#@Cyrus   
>time (sort file1 file2 | uniq -d > out8)
real    0m19.382s
user    0m18.726s
sys     0m0.295s

#@Sundeep
>time (awk 'BEGIN{while( (getline k < "file1")>0 ){a[k]}} $0 in a' file2 | sort -k1n > out5)
real    0m8.758s
user    0m8.315s
sys     0m0.255s

#@Sundeep
>time (perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <file1 file2 | sort -k1n > out6)
real    0m7.732s
user    0m7.300s
sys     0m0.310s

>head out1
1
2
3
4
5

All outputs are now identical.

mindlessgreen
  • 1,229
  • 4
  • 12
  • 21
  • Regarding your timing results: You should also time the `join` running on pre-sorted files. – Kusalananda Jan 20 '18 at 11:39
  • 1
    True. They are not exactly comparable since one sorts and other doesn't. I need to be going. I will do that when I am back. – mindlessgreen Jan 20 '18 at 11:46
  • In your performance comparison you leave out the one answer which has been written with performance in mind? – Hauke Laging Jan 20 '18 at 11:50
  • @Hauke I realised that after I ran the test. I will redo it with changes a bit later. – mindlessgreen Jan 20 '18 at 12:20
  • 1
    Updated time tests. @Hauke Your awk solution took too long to complete. I cancelled it after 8 mins. Perhaps there is something wrong somewhere. So I didn't include it in the updated timings. – mindlessgreen Jan 20 '18 at 17:43
  • I made [another benchmark](https://transang.me/files-intersection-benchmark/) on all method with little change in the condition and the results are also different. `comm` is the fastest, while `join` gives incorrect result due to localization, `grep` runs forever, etc... – Sang Nov 14 '19 at 14:27

5 Answers5

39

Simple comm + sort solution:

comm -12 <(sort file1) <(sort file2)
  • -12 - suppress column 1 and 2 (lines unique to FILE1 and FILE2 respectively), thus outputting only common lines (that appear in both files)
RomanPerekhrest
  • 29,703
  • 3
  • 43
  • 67
  • 6
    My first thought, too. Given that `comm` is _the_ purpose-built tool for intersection, I find it quite annoying that the `awk` solution is twice as fast! – bishop Jan 21 '18 at 04:08
  • 1
    @bishop Interestingly, the `join` command takes 75% of the time that `comm` takes, although it does not perform exactly the same task. – Michael Goldshteyn Nov 05 '20 at 14:53
15

In awk, this loads the first file fully in memory:

$ awk 'NR==FNR { lines[$0]=1; next } $0 in lines' file1 file2 
67
102

Or, if you want to keep track of how many times a given line appears:

$ awk 'NR==FNR { lines[$0] += 1; next } lines[$0] {print; lines[$0] -= 1}' file1 file2

join could do that, though it does require the input files to be sorted, so you need to do that first, and doing it loses the original ordering:

$ join <(sort file1) <(sort file2)
102
67
ilkkachu
  • 133,243
  • 15
  • 236
  • 397
8

awk

awk 'NR==FNR { p[NR]=$0; next; }
   { for(val in p) if($0==p[val]) { delete p[val]; print; } }' file1 file2

This is the good solution because (for large files) it should be the fastest as it omits both printing the same entry more than once and checking an entry again after it has been matched.

grep

grep -Fxf file1 file2

This would output the same entry several times if it occurs more than once in file2.

sort

For fun (should be much slower than grep):

sort -u file1 >t1
sort -u file2 >t2
sort t1 t2 | uniq -d
Hauke Laging
  • 88,146
  • 18
  • 125
  • 174
3

With GNU uniq:

sort file1 file2 | uniq -d

Output:

102
67
Cyrus
  • 12,059
  • 3
  • 29
  • 53
3

slightly different awk version and equivalent perl version

time reported for three consecutive runs

$ # just realized shuf -n2000000 -i1-2352452 can be used too ;)
$ shuf -i1-2352452 | head -n2000000 > f1
$ shuf -i1-2352452 | head -n2000000 > f2

$ time awk 'NR==FNR{a[$1]; next} $0 in a' f1 f2 > t1
real    0m3.322s
real    0m3.094s
real    0m3.029s

$ time awk 'BEGIN{while( (getline k < "f1")>0 ){a[k]}} $0 in a' f2 > t2
real    0m2.731s
real    0m2.777s
real    0m2.801s

$ time perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <f1 f2 > t3
real    0m2.643s
real    0m2.690s
real    0m2.630s

$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical

$ du -h f1 f2 t1
15M f1
15M f2
13M t1
Sundeep
  • 11,753
  • 2
  • 26
  • 57