2

I want to perform what some data analysis software call an anti-join: remove from one list those lines matching lines in another list. Here is some toy data and the expected output:

$ echo -e "a\nb\nc\nd" > list1
$ echo -e "c\nd\ne\nf" > list2
$ antijoincommand list1 list2
a
b
terdon
  • 234,489
  • 66
  • 447
  • 667
Josh
  • 303
  • 1
  • 13
  • 1
    Relating https://unix.stackexchange.com/q/11343/117549 – Jeff Schaller May 24 '20 at 15:30
  • 2
    Does this answer your question? [Is there a tool to get the lines in one file that are not in another?](https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another) – muru May 25 '20 at 07:16
  • @Muru, yes, that post provides the solutions presented in Terdon's answer. However, when I was searching for "bash anti-join" (the terminology I associate with this kind of process), I didn't find anything useful. My OP (which others have edited) stated that my explicit purpose in asking this question was to associate the term "anti-join" with the solutions, so that searching this term yields these solutions. Thanks. – Josh May 25 '20 at 15:08

3 Answers3

9

I wouldn't use join for this because join requires input to be sorted, which is an unnecessary complication for such a simple job. You could instead use grep:

$ grep -vxFf list2 list1
a
b

Or awk:

$ awk 'NR==FNR{++a[$0]} !a[$0]' list2 list1
a
b

If the files are already sorted, an alternative to join -v 1 would be comm -23

$ comm -23 list1 list2 
a
b
terdon
  • 234,489
  • 66
  • 447
  • 667
  • Avoiding `sort` with `grep` is great for the toy data I provided. Thanks! In the real world, my file1 often has multiple columns of data, one of which is being used for the join. A modified version of your `awk` code would address this use case. – Josh May 24 '20 at 13:46
  • 1
    @Josh yes, just change the `$0` with `$N` where `N` is the field number you are joining on. – terdon May 24 '20 at 13:47
  • 1
    This works even if the column numbers in file1 and file2 are different: like awk 'NR==FNR{++a[$2]} !a[$5]' list2 list1; quite usual for the tag file to be a different format to the main data. – Paul_Pedant May 24 '20 at 14:14
  • upvoted for the `comm -23` command – user2297550 Jan 30 '22 at 08:17
3

One way to do this with the join utility is:

$ join -v 1 list1 list2
a
b

From the manpage:

-a FILENUM

: also print unpairable lines from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2

-v FILENUM

: like -a FILENUM, but suppress joined output lines

Geremia
  • 1,163
  • 1
  • 13
  • 23
Josh
  • 303
  • 1
  • 13
0

Using Raku (formerly known as Perl_6)

Raku has Set object types, and you can read individual files to create Sets from lines:

~$ raku -e 'my $a = Set.new: "list1".IO.lines; 
            my $b = Set.new: "list2".IO.lines; 
            say "list1 = ", $a;
            say "list2 = ", $b;'
list1 = Set(a b c d)
list2 = Set(c d e f)

You can perform asymmetric Set differences, with either ASCII infix (-), or Unicode infix :

~$ raku -e 'my $a = Set.new: "list1".IO.lines; 
            my $b = Set.new: "list2".IO.lines; 
            say $a (-) $b;'
Set(a b)
~$ raku -e 'my $a = Set.new: "list1".IO.lines; 
            my $b = Set.new: "list2".IO.lines; 
            say $b (-) $a;'
Set(e f)

OTOH, sometimes you need to perform a symmetric Set difference, and Raku has you covered. Use either ASCII infix (^) or Unicode infix :

~$ raku -e 'my $a = Set.new: "list1".IO.lines; 
            my $b = Set.new: "list2".IO.lines; 
            say $a (^) $b;'
Set(a b e f)

Finally, you can get linewise output by changing the final line to .keys.put for … .
Final symmetric Set difference example below, using Unicode infix operator:

~$ raku -e 'my $a = Set.new: "list1".IO.lines;
            my $b = Set.new: "list2".IO.lines;
            .keys.put for $a ⊖ $b;'
f
e
a
b

https://docs.raku.org/type/Set
https://docs.raku.org/language/setbagmix#Operators_with_set_semantics
https://raku.org

jubilatious1
  • 2,385
  • 8
  • 16