1

I have a list of IDs (sorted) in two files and I ran the comm command to compare them, but it seems to miss out one lines common to both files. Why is that?

File1:

1
2
3
4
5
6
7
8
9
11
12
13
15
16
17
18
19
20
21
22

File2:

16
18
21
23
705
707
709
711
712
826
827
839
846
847
848
872
873
874
875
891

Comm output: $> comm file1 file1

1
    16  //exists in both files
    18  //exists in both files
2
    21
    23
3
4
5
6
7
    705
    707
    709
    711
    712
8
    826
    827
    839
    846
    847
    848
    872
    873
    874
    875
    891
9
11
12
13
15
16 //it's here!
17 
18 //...and here!
19
20
21
22

The files are both sorted. However, my guess is that comm doesn't do numeric comparison and only looks at entries lexicographically? If so, what are some alternatives that I can try for this?

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
PhD
  • 165
  • 4

1 Answers1

4

comm should tell you that one of the files isn’t sorted:

comm: file 1 is not in sorted order

It expects the files to be sorted using the current locale’s collation order (as determined by LC_COLLATE); it won’t accept numerical order.

To compare the files, you can pre-sort them (lexicographically as you point out):

comm <(sort file1) <(sort file2)

If you want the result to be sorted numerically, sort it again:

comm <(sort file1) <(sort file2) | sort -n

This produces

1
2
3
4
5
6
7
8
9
11
12
13
15
        16
17
        18
19
20
        21
22
    23
    705
    707
    709
    711
    712
    826
    827
    839
    846
    847
    848
    872
    873
    874
    875
    891
Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
  • The files ARE in sorted order...have a look at the snippets. – PhD Apr 13 '17 at 21:17
  • As far as `comm` is concerned, `file1` *isn’t* sorted: it expects files to be sorted in lexicographical order, not in numerical order. – Stephen Kitt Apr 13 '17 at 21:21
  • 1
    I *had* run the commands I gave you, as I always do. I’ve added the output. Note you missed 21 which is also common to both files. – Stephen Kitt Apr 13 '17 at 21:24
  • I see - so even if they are sorted, it needs "lexicographical ordering" only to do its job correctly. Interesting nuance :) – PhD Apr 13 '17 at 21:25
  • Yes, as I mentioned in my comment above; I’ll add that to my answer, for clarity. – Stephen Kitt Apr 13 '17 at 21:26