Handling 3 Files using awk

Question

Consider following files:

file1:

boo,8,1024
foo,7,2048

file2:

foo,0,24,154
noo,0,10,561

file3:

24,154,7,1024,0

What I need is to go to File1 and check if $2==7; if true, take $1, $2 and $3 from File1; now I have to compare if $1 from File1 equal to $1 from File2; if true, I have to take $3 and $4 from File2 which not exist in File1, then I have to go to File3 and check if $1 from File3 is equal to $3 from File2, and $2 from File3 is equal to $4 from File2; if yes, then I have to check if $2 from File1 is equal to $3 from File3, then if this condition is true, I have to compare $3 from File1 with $4 from File3, if $3 from File1 is more than $4 from File3.

I tried the following script:

cat [file1] [file2] [file3] | 
awk -F, 
'{if(NF==3)
    {if($2==7){a[$1]=$1; b[$1]=$2; c[$1]=$3}
    }else
        {if(NF==4){if(a[$1]==$1){d[$3]=$3; e[$4]=$4}
                  }else
                        {if(NF==5){if(d[$1]==$1 && e[$2]==$2){print a[$1], b[$1], c[$1], d[$1]}}
                        }
                  }

  }'

The desired output is:

foo,7,2048,24,154,1024

score 9 · Accepted Answer · answered Sep 08 '15 at 10:10

9

That worked for me:

awk -F, 'FNR==1{++f} \
  f==1 && $2==7 {a1[$1]++; a2[$2]=$3; o=$0} \
  f==2 && a1[$1] {o=o","$3","$4; a3[$3]=$4} \
  f==3 && a3[$1] && $2==a3[$1] && a2[$3] && $4<a2[$3] {print o}' \
file1 file2 file3

Explanation:

The first line (FNR==1{++f}) increments the file index to later determine in which file we are 1-3.
file1: if $2 equals 7
- fill an array a1 with $1 as index and a2 with $2 as index and $3 as value
- write down the o variable (output) with the first 3 fields
file2: if $1 of file2 equals $1 of file1 (prevously written in a1)
- append $3 and $4 to the output variable o.
- fill an array a3 with $3 as index and $4 as value.
file3: if:
- $1 equals file2s $3 (index of a3)
- $2 equals file2s $4 (value of a3)
- $3 equals file1s $2 (index of a2)
- $4 is lower than file1s $3 (value of a2)
then:
- print the value of o.

answered Sep 08 '15 at 10:10

chaos

47,463
11
118
144

Is there a need for backslash (apart the last) ? how about BEGINFILE (instead of FNR==1 ) ? – Archemar Sep 08 '15 at 11:46
@Archemar BEGINFILE and ENDFILE are gawk extensions and the backslashes can all be removed, I inseted them, for better readablility: you can write that whole thing in one single line, but it would not look nice – chaos Sep 08 '15 at 11:51
@chaos, thank you, but unfortunately it always returns null. – Eng7 Sep 08 '15 at 15:33
@Azizieh7 I tested it with mawk and gawk with your 3 input example files. For me it worked. Do you use different input files or encodings/linebreaks? – chaos Sep 08 '15 at 15:51
@chaos, there is different linebreaks in file3, but I use tr -d '\015' to overcome this. – Eng7 Sep 08 '15 at 16:00
@Azizieh7 There must be simethign else, have you tried with your given example files? Without special encoding? I think there is something wrong with this file3. – chaos Sep 08 '15 at 16:07
@chaos `o` in your code will save the last record in `file1` only – Eng7 Sep 09 '15 at 09:37
@Azizieh7 `o` contains the whole line of file1 `o=$0` and later the field 3 and 4 of file2 `o=o","$3","$4`. I'm sure, your files have special encodings, can you upload the 3 demo files and give a link, so I can try with *your* files? What os are you working on? – chaos Sep 09 '15 at 09:42

Kaz · Answer 2 · 2016-09-13T05:30:48.503

TXR solution:

@(repeat)
@id,@val0,@val1
@  (next)
@  (skip)
@id,@nil,@val2,@val3
@  (next)
@val2,@val3,@val0,@val4,@val5
@  (require (< (int-str val4) (int-str val1)))
@  (output)
@id,@val0,@val1,@val2,@val3,@val4
@  (end)
@(end)

Run:

$ txr join.txr file1 file2 file3
foo,7,2048,24,154,1024

But the astute observer will note that the 7 hasn't been specified anywhere in the code, appearing just in the output! That is because code is actually marching through all records in file1 and prints all the combinations that meet the matches and constraint. The only one in the sample data is the one with val0 being 7.

If there were more combinations found, it could be constrained to just the 7 one like this:

$ txr -Dval0=7 join.txr file1 file2 file3
foo,7,2048,24,154,1024

# how about 6?
$ txr -Dval0=6 join.txr file1 file2 file3
# no output

The TXR pattern extraction language provides here one big pattern match with implicit back-references through the repetition of variable names, spanning multiple files, with multi-line extraction patterns, and non-textual constraints, plus embedded side effects like output, and so on.

Accepted Awk solution carefully translated the TXR Lisp awk macro:

(awk (:begin (set fs "," ofs ","))
     (:let o (a1 (hash :equal-based)) (a2 (hash)) (a3 (hash)))
     (t (mf [orf int-str identity])) ;; map those fields to integers, which can be
     ((and (= arg 1) (= [f 1] 7)) (inc [a1 [f 0] 0])
                                  (set [a2 [f 1]] [f 2])
                                  (set o rec))
     ((and (= arg 2) [a1 [f 0]]) (set o `@o,@[f 2],@[f 3]`)
                                 (set [a3 [f 2]] [f 3]))
     ((and (= arg 3)
           [a3 [f 0]]
           (= [f 1] [a3 [f 0]])
           [a2 [f 2]]
           (< [f 3] [a2 [f 2]])) (prn o)))

Run:

$ txr awkit.tl file1 file2 file3
foo,7,2048,24,154

The ,1024 required part in the output is missing; the original "Awk Classic" has this behavior.

Handling 3 Files using awk

2 Answers2