Merging files based on potentially incomplete keys

Question

I want to merge two files like How to merge two files based on the matching of two columns? but one file may not have all results. So for example

file1

1 dog
2 cat
3 fox
4 cow

file2

1 woof
2 meow
4 mooh

wanted output

1 dog woof
2 cat meow
3 fox
4 cow mooh

then what's wrong with those answers there applying on your files? — αғsнιη, Jan 30 '22 at 16:12
@αғsнιη I don't understand the first one and cannot make it work on my example. The awk answer ignores the second file, and the paste + awk command gives 1 dog 1 2 cat 2 3 fox 4 4 cow as output — sadfasdfasdfddd, Jan 30 '22 at 16:22

Ed Morton · Answer 1 · 2022-01-30T18:32:17.427

With GNU awk for arrays of arrays:

$ awk '{a[$1][(NR>FNR)]=$2} END{for (i in a) print i, a[i][0], a[i][1]}' file{1,2}
1 dog woof
2 cat meow
3 fox
4 cow mooh

or with any awk:

$ awk '{keys[$1]; a[$1,(NR>FNR)]=$2} END{for (i in keys) print i, a[i,0], a[i,1]}' file{1,2}
1 dog woof
2 cat meow
3 fox
4 cow mooh

Although the above produces the output in numerically ascending order of the first field that's just luck/coincidence - the order of the output lines is actually "random" (typically hash order) courtesy of the "in" operator. Pipe the output to sort -k1,1n (or set PROCINFO["sorted_in"]="@ind_num_asc" at the start of the END section in GNU awk) if you care about that.

The significant differences between this and a join solution are that:

This will work even if the input is not sorted while join requires input sorted on the key field(s) and,
If there's a line in file2 with a key not present in file1 (or vice-versa) this will display it spaced in a way that you can tell which file that unique line came from (unlike adding -a2 to a join command).

Here's some more comprehensive sample input/output to test with:

$ head file{1,2}
==> file1 <==
1 dog
2 cat
4 cow
5 bear

==> file2 <==
1 woof
2 meow
3 growl
4 mooh

which we can then run either of the above awk scripts on to get the same output:

$ awk '{a[$1][(NR>FNR)]=$2} END{for (i in a) print i, a[i][0], a[i][1]}' file{1,2}
1 dog woof
2 cat meow
3  growl
4 cow mooh
5 bear

and note that 3 growl has an extra blank before growl so you know that was a unique line from file2 as opposed to using join:

$ join -a1 -a2 file1 file2
1 dog woof
2 cat meow
3 growl
4 cow mooh
5 bear

where you can't tell a unique line from file1 (e.g. 5 bear) from a unique line from file2 (e.g. 3 growl).

score 4 · Answer 2 · answered Jan 30 '22 at 17:06

Assuming that both file1 and file2 are sorted, join by default will only join lines where both files have their key. So in you case, file2 doesn't have a line with a key of "3", so this line is not joined. However, you can change this behavior.

In the man pages of join:

   -a FILENUM
          also print unpairable lines from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2

So if you add the -a1 flag to your join command, any lines in file1 that doesn't have a matching key in file2 will also be printed.

# join -a1 file1 file2
1 dog woof
2 cat meow
3 fox
4 cow mooh

Please notice this will not handle unpairable lines from file2, so if in file2 you have another line like:

5 quack

This line will not be printed. You can also add -a2 flag to your join command which will print the line from file2, but this will just add some confusion since you won't know if this line came from file1 or from file2.

If you make it `-o0,1.2,2.2 -a1 -a2`, you'll be able to tell which is missing and remove that confusion. — Stéphane Chazelas, Jan 31 '22 at 14:38
Note that `file1` and `file2` need to be sorted lexically (not numerically) *on the joining key*, so with `sort -bk 1,1` here. — Stéphane Chazelas, Jan 31 '22 at 14:40
In relational DB parlance this is called a "left outer join" or "right outer join". Including unpairable lines from both files would be a "full outer join". — Barmar, Jan 31 '22 at 17:28

score 2 · Answer 3 · answered Jan 30 '22 at 16:28

2

Put your keys and values into an associative array and print it into file3:

declare -A arr

while read key value
do
    if [ -z ${arr[$key]} ]; then
        arr[$key]=$value
    else
        arr[$key]="${arr[$key]} $value"
    fi
done < <(cat file1 file2)

echo -n > file3

for key in "${!arr[@]}"
do
    echo "$key ${arr[$key]}" >> file3
done

answered Jan 30 '22 at 16:28

Juxtaposed

161
7

That would be orders of magnitude slower than an equivalent awk solution and, like the join solution, wouldn't differentiate in the output between keys unique to file1 vs unique to file2. It'd also produce the output in alphabetic instead of numeric order of the array indices (i.e. `10` would come before `3`) so it might need a pipe to `sort -k1,1n` if that matters. – Ed Morton Jan 30 '22 at 17:55

Merging files based on potentially incomplete keys

3 Answers3