7

I'm trying to match and merge two sets of sorted data, one set per file. Each file contains two columns: the key field and the associated value. The resulting output should contain three columns: the key field, the value (if any) from the first file, and the value (if any) from the second file. I need to include lines of data that are not matched.

First file "john"

apple,green
cherry,red
orange,orange

Second file "jane"

apple,red
banana,yellow
cherry,yellow
kiwi,green

Desired result

apple,green,red
banana,,yellow
cherry,red,yellow
kiwi,,green
orange,orange,

I thought initially that this was a trivial job for join

LC_ALL=C join -j1 -a1 -a2 -t',' john jane

But the result of the -a1 -a2 puts the unmatched value always in the second column:

apple,green,red
banana,yellow
cherry,red,yellow
kiwi,green
orange,orange

I need to be able to see from which source file the unmatched value originates, ideally by having those values in the appropriate second or third column of the result file, but I cannot work out a simple way of achieving this without descending into awk ... getline() type constructs.

Any suggestions, please?

Wildcard
  • 35,316
  • 26
  • 130
  • 258
roaima
  • 107,089
  • 14
  • 139
  • 261

4 Answers4

10

You want -o auto:

join -t, -j 1 -a 1 -a 2 -o auto john jane

From man join:

-o FORMAT

    obey FORMAT while constructing output line
     ︙
If FORMAT is the keyword 'auto', then the first line of each file determines the number of fields output for each line.

Or better explained from GNU Coreutils: join invocation (follow the link into General options in join):

-o auto

If the keyword ‘auto’ is specified, infer the output format from the first line in each file. This is the same as the default output format but also ensures the same number of fields are output for each line. Missing fields are replaced with the -e option and extra fields are discarded.

% cat john 
apple,green
cherry,red
orange,orange
% cat jane 
apple,red
banana,yellow
cherry,yellow
kiwi,green
% join -t, -j 1 -a 1 -a 2 -o auto john jane
apple,green,red
banana,,yellow
cherry,red,yellow
kiwi,,green
orange,orange,
kos
  • 2,827
  • 1
  • 11
  • 19
  • On first (and second) reading I had still misunderstood the `FORMAT` construct, which states « _Default FORMAT outputs the join field, the remaining fields from FILE1, the remaining fields from FILE2, all separated by CHAR._ » without making it obvious that _the remaining fields_ might be empty if there was no match. Thank you for a working example (and a solution). – roaima Apr 05 '16 at 22:30
  • 1
    @roaima I completely agree `man join` is not clear there. Actually the only way that could be inferred is from the explanation of `-o 'auto'` at [GNU Coreutils: join invocation](http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html#join-invocation), where what happens when using `-o 'auto'` is compared to what happens when not specifying `-o FORMAT` at all. – kos Apr 05 '16 at 22:46
3

You could explicitly specify the output format

LC_ALL=C join -o0,1.2,2.2 -j1 -a1 -a2 -t',' john jane

which produces

apple,green,red
banana,,yellow
cherry,red,yellow
kiwi,,green
orange,orange,

The key thing here is that the join field can also be referenced within the output format using 0, which is quite useful in the context of unpairable lines

iruvar
  • 16,515
  • 8
  • 49
  • 81
  • In this particular case am I correct believing that your explicit format is the same as that generated by `-o auto`? Either way I'm pleased to have learned about the `-o` option but a little disappointed I can't accept your answer too. – roaima Apr 05 '16 at 23:14
  • @roaima, no worries. I learned something new today with -o auto. – iruvar Apr 05 '16 at 23:47
  • 1
    @roaima Specifying the exact file / field pairs is always the way to go in case the fields to print are something different from the joint field, the rest of the fields found in the first line of the first file and the rest of the fields found in the first line of the second file (which is your case, but it isn't always like that). This in general is way more flexible than `auto`; in this specific case it does the same as `auto` but explicitly. – kos Apr 05 '16 at 23:49
1

This command almost does it; it omits a trailing comma if the key only appears in file1. Don't have time to fully debug now:

awk -F, 'BEGIN{OFS=","} FNR==NR{val[$1]=$2;next} {val[$1]=val[$1] "," $2}END{for (key in val) {print key, val[key]}}' john jane

Output:

apple,green,red
banana,,yellow
cherry,red,yellow
kiwi,,green
orange,orange
Wildcard
  • 35,316
  • 26
  • 130
  • 258
  • I was hoping to avoid the "read it all and then process" approach, but thank you for the suggestion. – roaima Apr 05 '16 at 22:26
  • 1
    @roaima, I see now that your input files *appear* sorted, but you didn't actually specify that they are sorted before processing. I wasn't looking that closely at the example files themselves. :) But yes, assuming sorted input, it is better to do this without slurping. – Wildcard Apr 05 '16 at 22:58
  • Ah. That comes of my providing a cut down example to illustrate the problem at hand. (Yes, they would have been sorted, also with `LC_ALL=C`.) – roaima Apr 05 '16 at 23:12
-1

Using the R-programming language

The two data files are read into the R REPL:

> john <- read.csv("/Users/admin/john", header=FALSE, stringsAsFactors=FALSE)
> john
      V1     V2
1  apple  green
2 cherry    red
3 orange orange
>
> jane <- read.csv("/Users/admin/jane", header=FALSE, stringsAsFactors=FALSE)
> jane
      V1     V2
1  apple    red
2 banana yellow
3 cherry yellow
4   kiwi  green

To merge the data files in the R REPL:

> merge(john, jane, by = c("V1"), all=TRUE)
      V1   V2.x   V2.y
1  apple  green    red
2 banana   <NA> yellow
3 cherry    red yellow
4   kiwi   <NA>  green
5 orange orange   <NA>
> 

Output to file (using write.table):

> write.table( merge(john, jane, by = c("V1"), all=TRUE), "john_jane.csv", sep=",", quote=F, row.names=F, col.names=F, na="")

Resulting file ("john_jane.csv"):

apple,green,red
banana,,yellow
cherry,red,yellow
kiwi,,green
orange,orange,

As you can see from the write.table parameters, <NA> values are set to the empty string with na="". For further help, while in the R REPL type the command at the prompt preceded by a question-mark, as in ?getwd(), ?setwd(), ?read.csv() or ?merge().

[Note: including the parameter stringsAsFactors=FALSE to each of your read.csv() function calls may be superfluous depending on the age of your R-installation].

https://www.r-project.org/
https://cran.r-project.org/index.html

jubilatious1
  • 2,385
  • 8
  • 16
  • A downvote indicates there's room for improvement. Correspondingly I edited this post to add the `write.table` code, which produces the exact output requested by @roaima . If further improvements are required, please let me know. Thanks. – jubilatious1 Nov 07 '21 at 16:39