46

I have file1 likes:

0   AFFX-SNP-000541  NA
0   AFFX-SNP-002255  NA
1   rs12103          0.6401
1   rs12103_1247494  0.696
1   rs12142199       0.7672

And a file2:

0   AFFX-SNP-000541   1
0   AFFX-SNP-002255   1
1   rs12103           0.5596
1   rs12103_1247494   0.5581
1   rs12142199        0.4931

And would like a file3 such that:

0   AFFX-SNP-000541     NA       1
0   AFFX-SNP-002255     NA       1
1   rs12103             0.6401   0.5596
1   rs12103_1247494     0.696    0.5581
1   rs12142199          0.7672   0.4931

Which means to put the 4th column of file2 to file1 by the name of the 2nd column.

Yogesh D
  • 243
  • 2
  • 9
Dadong Zhang
  • 669
  • 1
  • 7
  • 10

3 Answers3

66

This should do it:

join -j 2 -o 1.1,1.2,1.3,2.3 file1 file2

Important: this assumes your files are sorted (as in your example) according to the SNP name. If they are not, sort them first:

join -j 2 -o 1.1,1.2,1.3,2.3 <(sort -k2 file1) <(sort -k2 file2)

Output:

0 AFFX-SNP-000541 NA 1
0 AFFX-SNP-002255 NA 1
1 rs12103 0.6401 0.5596
1 rs12103_1247494 0.696 0.5581
1 rs12142199 0.7672 0.4931

Explanation (from info join):

`join' writes to standard output a line for each pair of input lines that have identical join fields.

`-1 FIELD'
     Join on field FIELD (a positive integer) of file 1.

`-2 FIELD'
     Join on field FIELD (a positive integer) of file 2.

`-j FIELD'
     Equivalent to `-1 FIELD -2 FIELD'.

`-o FIELD-LIST'

 Otherwise, construct each output line according to the format in
 FIELD-LIST.  Each element in FIELD-LIST is either the single
 character `0' or has the form M.N where the file number, M, is `1'
 or `2' and N is a positive field number.

So, the command above joins the files on the second field and prints the 1st,2nd and 3rd field of file one, followed by the 3rd field of file2.

terdon
  • 234,489
  • 66
  • 447
  • 667
20

You could use awk:

$ awk 'NR==FNR {h[$2] = $3; next} {print $1,$2,$3,h[$2]}' file2 file1 > file3

output:

$ cat file3
0 AFFX-SNP-000541 NA 1
0 AFFX-SNP-002255 NA 1
1 rs12103 0.6401 0.5596
1 rs12103_1247494 0.696 0.5581
1 rs12142199 0.7672 0.4931

Explanation:

Walk through file2 (NR==FNR is only true for the first file argument). Save column 3 in hash-array using column 2 as key: h[$2] = $3. Then walk through file1 and output all three columns $1,$2,$3, appending the corresponding saved column from hash-array h[$2].

grebneke
  • 4,621
  • 25
  • 20
  • Thanks a lot. Just wondering, what does the 'h[$2]=$3' mean? Actually I need to match exactly the file1$2==file2$2 in my complex cases (that are not necessary in same order). – Dadong Zhang Feb 06 '14 at 20:43
  • 2
    `h[$2] = $3` is a hash assignment. It save `$3` as the value and `$2` as the key. Example: `h["name"] = "Dadong"`. Now, `print h["name"]` outputs `Dadong`. It does what you want, it matches exactly the second column from both files. – grebneke Feb 06 '14 at 20:46
7

If you don't need any ordering, than a simple solution would be

paste file{1,2} | awk '{print $1,$2,$3,$6}' > file3

This presumes that all rows have three entries, and column 1 and 2 of both files are the same (as in your example data)

Bernhard
  • 11,992
  • 4
  • 59
  • 69
  • 1
    +1 for great use of `paste` – grebneke Feb 06 '14 at 20:42
  • 1
    @grebneke and Bernhard, since you seem to be fans of `paste` can you figure out a way to answer [this](http://unix.stackexchange.com/q/113879/22222) with coreutils? – terdon Feb 06 '14 at 20:52
  • @terdon - a humble attempt: http://unix.stackexchange.com/a/113909/32165 – grebneke Feb 06 '14 at 21:04
  • 1
    @terdon I'd advise the to reconsider the program that is outputting this s*** – Bernhard Feb 06 '14 at 21:09
  • Nothing wrong with the format, perfectly decent tab separated files. In any case, with this kind of data you usually have no choice as to the format, it comes out of another program. – terdon Feb 06 '14 at 21:10
  • @terdon Yeah, that was my way of saying: to complicated for me ;) – Bernhard Feb 06 '14 at 21:11
  • You could golf it shorter with: `paste file1 <(cut -f3 file2) > file3` (assuming tab delimited columns) – don_crissti Apr 27 '15 at 18:53