4

I have tons of CSV files with similar content. The values are usually comma separated and they look like this.

product_a,  domestic,   500
product_a,  abroad,     15
product_b,  domestic,   313
product_b,  abroad,     35
product_c,  domestic,   411
product_c,  abroad,     84
product_d,  domestic,   25
product_d,  abroad,     2
...

What I'm trying to accomplish, with AWK (because I believe SED is not the right tool for such operations, but I'm relatively new Linux user...) is sum of each product (Column 1), insterted as a column nr 2. I can do something like this

awk -F, '{a[$1]+=$3;}END{for (i in a)print i", "a[i];}' filename

to get those values (sums)

product_a,  515
product_b,  348
product_c,  495
product_d,  27
...

but still I have no idea how to insert them as a second column in the original file, in form like this:

product_a,  515, domestic,  500
product_a,  515, abroad,    15
product_b,  348, domestic,  313
product_b,  348, abroad,    35
product_c,  495, domestic,  411
product_c,  495, abroad,    84
product_d,  27,  domestic,  25
product_d,  27,  abroad,    2
...

I've been using a little bit of sed and awk lately, but my tries usually gets me Error (like: attempt to use a scalar value as array).

Order of rows are not my concern, but I assume that I'll able to use the answer as a batch file command.

$ for f in *.csv; do
  That Shiny Enigmatic Command > tmp && mv tmp $f
  done

EDIT
Thanks to @KM. I've get to the place, where I can do what I want in 3 steps.

1 step:

$ for f in *.csv; do 
awk -F, '{a[$1]+=$3;}END{for (i in a)print i", "a[i];}' $f | sort > sum$f
done

2nd step:

$ for f in [^sum]*.csv; do 
join -t ',' $f sum$f | awk -F, '{print $1"," $4"," $2"," $3}' > tmp && mv tmp $f; 
done

To eventually just rm sum*.*. Is there is a way to execute it as one command from terminal? Or outside of it?

Je.dno
  • 43
  • 5
  • 1
    `awk` from version 4.1.0 has an `inplace` option. You need to check with `awk -V`. – garethTheRed May 16 '16 at 15:02
  • That's a very unusual file format - every product line would have to have the sum column recalculated whenever any matching product line is added or updated. **What** are you going to use that file for after the sum column has been added? And **how** are you going to use it? Are you sure that that's actually the best way to do whatever it is that you're really trying to do? – cas May 16 '16 at 21:51
  • Lines won't be added or updated, but the names in column A and B can be completely different, which i cannot catch with any regexp. Those files are an exported result of daily proccesses. I need to make a few batch changes on them and then pass it to client, which later maps it. So I'm kind of forced to this. – Je.dno May 17 '16 at 07:39
  • Of course, i'm forced to get that kind of output, not the means. – Je.dno May 17 '16 at 13:44
  • If you want one command, make this into a proper shell script. Don't get too hung up on reducing the number of lines, characters, steps, etc., lest you loose readability and maintainability. Step-by-step code is also easier to debug. FWIW, YMMV. – KM. May 17 '16 at 18:31
  • Ok. For now, I'm limited by my knowledge - I don't know much about shell scripts. Nonetheless, I think what you said is the way to go. Just a next thing to learn - could you point any place to start? – Je.dno May 17 '16 at 18:55
  • http://www.tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html ... or ... your local library (-: – KM. May 18 '16 at 14:25

1 Answers1

3

Save the sum to a file called sum, sorted

awk -F, '{a[$1]+=$3;}END{for (i in a)print i", "a[i];}' filename | sort > sum

cat sum 
product_a, 515
product_b, 348
product_c, 495
product_d, 27

Join the two files, first column of first file with first column of second (think "keys"); pipe it to awk and print reordered columns, using , as field separator (-F) and as the Output Field Separator (-OFS)

join -t ','  -1 1 -2 1 filename sum | awk -F, -OFS=, {'print $1,$4,$2,$3}'

product_a,  515,  domestic,  500
product_a,  515,  abroad,    15
product_b,  348,  domestic,  313
product_b,  348,  abroad,    35
product_c,  495,  domestic,  411
product_c,  495,  abroad,    84
product_d,  27,   domestic,  25
product_d,  27,   abroad,    2
KM.
  • 2,204
  • 2
  • 19
  • 25
  • Ok, it worked for me when input file was sorted. If not i added sort in join - `join -t ',' <(sort -t ',' filename) sum | awk -F, '{print $1"," $4"," $2"," $3}'`. BTW What command will allow me to sort the original file (and overwrite it)? Because `sort -t ',' > filename` won't work - terminal hangs... – Je.dno May 17 '16 at 17:38
  • GNU `sort` has the `-o` option; look at the man pages of your `sort` command to be sure; if you do, the input and output file can be the same. – KM. May 17 '16 at 18:29
  • Thank you, I've actually found out that few minutes ago and it works :)! – Je.dno May 17 '16 at 18:31