comm for n files

Question

I am looking for comm's functionality for n, i. e. more than two, files.

man comm reads:

COMM(1)

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.

       With no options, produce three-column output.
       Column one contains lines unique to FILE1,
       column two contains lines unique to FILE2,
       and column three contains lines common to both files.

A first non-optimized and differently formatted approach in bash to illustrate the idea:

user@host MINGW64 dir
$ ls
abc  ac  ad  bca  bcd

user@host MINGW64 dir
$ tail -n +1 *
==> abc <==
a
b
c

==> ac <==
a
c

==> ad <==
a
d

==> bca <==
b
c
a

==> bcd <==
b
c
d

user@host MINGW64 dir
$ bat otherdir/ncomm.sh
───────┬───────────────────────────────────────────────────────────────────────
       │ File: otherdir/ncomm.sh
───────┼───────────────────────────────────────────────────────────────────────
   1   │ #!/usr/bin/env bash
   2   │ ALLENTRIES=$(sort -u "$@")
   3   │ echo "all $*" | tr " " "\t"
   4   │
   5   │ for entry in $ALLENTRIES; do
   6   │     >&2 echo -en "${entry}\t"
   7   │     for file in "$@"; do
   8   │         foundentry=$(grep "$entry" "$file")
   9   │         echo -en "${foundentry}\t"
  10   │     done
  11   │     echo -en "\n"
  12   │ done
───────┴───────────────────────────────────────────────────────────────────────

user@host MINGW64 dir
$ time otherdir/ncomm.sh *
all     abc     ac      ad      bca     bcd
a       a       a       a       a
b       b                       b       b
c       c       c               c       c
d                       d               d

real    0m12.921s
user    0m0.579s
sys     0m4.586s

user@host MINGW64 dir
$

This displays column headers (to stderr), a first column "all" with all entries found in either file, sorted and then one column per file from the parameter list with their entries in the respective row. As for each cell outside of the first column and first row, grep is invoked once, this is really slow.

As for comm, this output is only suitable for short lines/entries like ids. A more concise version could output an x or similar for each found entry in columns 2+.

This should work on Git for Windows' MSYS2 and on RHEL.

How can this be achieved in a more performant manner?

How do you expect the compare to behave when there are identical lines in the file? As in `abab`? `aaba`? Or `cdab`? Or can you expect all files to be "sorted", wrt. to some kind of order (needn't be alphabetical)? (If they are sorted, an efficient algorithm is easy). — dirkt, Jul 29 '21 at 04:56
You will likely have to write your own program. As you include more files, the number of possible combinations increases geometrically, as does processing, greatly impacting performance. — C. M., Jul 29 '21 at 09:17

score 0 · Answer 1 · answered Jul 29 '21 at 09:51

meld (which however is a graphical program) can still manage three-way comparisons between files (i.e. n=3), but for anything larger it gets computationally more and more complex to achieve this, so I don't know if a truly "generalized diff (or comm) tool" is feasible at all.

score 0 · Answer 2 · answered Jan 31 '22 at 14:48

You could try the approach below. It has these characteristics:

Output follows exactly your example
Values get sorted during processing
- => pre-sorting can be skipped
- => original order is NOT preserved
Input file names get sorted.
Duplicate values are cleaned up and consolidated to only one occurrence (thereby also fixing a bug in your script, which shows strange behavior for duplicates)
requires a recent GNU AWK as it uses its builtin array sorting feature
tailored to UNIX line endings, mixup of different line ending styles will lead to strange effects. (For the program "a" and "a\r" are different things!)

Just save the code into a text file and provide execution permissions to use it as a drop-in replacement for your shell script. Your processing should gain some speedup. (Actually it's several orders of magnitude.) :)

#!/usr/bin/gawk -f
{
    all[$0]
    filenames[FILENAME]
    input[$0,FILENAME]=$0
    # if you only wanted to to mark existence
    # then uncomment the following line 
    # input[$0,FILENAME]="*"
}

END {
    PROCINFO["sorted_in"]="@ind_str_asc"
    printf "all"
    for (i in filenames) {
        printf("\t%s",i)
    }
    for (i in all) {
        printf("\n%s",i)
        for (j in filenames) {
            printf("\t%s",input[i,j])
        }
    }
    print ""
}

comm for n files

2 Answers2