5

I have a table data like below

abc 1   1   1
bcd 2   2   4
bcd 12  23  3
cde 3   5   5
cde 3   4   5
cde 14  2   25

I want the sum of values in each column based on variables in first column and desired result is like below:

abc 1   1   1
bcd 14  25  7
cde 20  11  35

I used awk command like this

awk -F"\t" '{for(n=2;n<=NF; ++n)a[$1]+=$n}END{for(i in a ) print i, a[i] }' tablefilepath

and I got a result below:

abc 3
bcd 46
cde 66

I think the end of my code is wrong but don't know how to fix it. I need some directions to fix the code.

terdon
  • 234,489
  • 66
  • 447
  • 667
awkprob
  • 53
  • 3

3 Answers3

4

So long as your file is tab-delimited, datamash is a good fit for this.

$ datamash groupby 1 sum 2 sum 3 sum 4 < tablefilepath
abc     1       1       1
bcd     14      25      7
cde     20      11      35

Datamash can also work with non-tabs, if you specify -t <delimiter>. But tabs seem closest to the example input you have provided.

Datamash won't work if your input is delimited by arbitrary whitespace (i.e. possible multiple spaces intended to "look like" a tab). Still, even if that is what your data looks like, it is easily munged into the form expected by datamash:

sed -i 's/ \+/\t/g' tablefilepath
cryptarch
  • 1,255
  • 5
  • 23
  • 1
    At least in recent versions, there's a `-W` (`--whitespace`) option that should allow arbitrary whitespace delimiters – steeldriver Nov 27 '18 at 06:17
4

You were fairly close.  You see what you were doing wrong, don't you?  You were keeping one total for each column 1 value, when you should have been keeping three.

This is similar to Inian's answer, but trivially extendable to handle any number of columns:

awk -F"\t" '{for(n=2;n<=NF; ++n) a[$1][n]+=$n}
        END {for(i in a) {
                printf "%s", i
                for (n=2; n<=4; ++n) printf "\t%s", a[i][n]
                printf "\n"
             }
        }'

Rather than keep three arrays, like Inian's answer, it keeps a two-dimensional array.

  • Why limit it at all? Why not `awk '{for(n=2;n<=NF; ++n){a[$1][n]+=$n}}END{for(i in a){ printf "%s ", i; for(k in a[i]){printf "%s ",a[i][k]} print ""}}'`? I mean, why use `for (n=2; n<=4; ++n)` in the `END{}` block instead of just iterating over the array so you don't need to keep track of its size? – terdon Nov 27 '18 at 11:46
  • @terdon: Thanks for dropping by.  "`for (variable in array)` [which] shall iterate, assigning each index of array to variable ***in an unspecified order.***" — [POSIX](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html)  Inian and I failed to mention that our answers produce output in random order (specifically, I get `bcd`, `abc`, `cde`); but that can be fixed by piping `awk` into `sort`.  Your enhancement would output the columns in random order, with no way to fix it by post-processing. – Scott - Слава Україні Nov 27 '18 at 19:10
  • Ah, yes indeed. Fair point. – terdon Nov 27 '18 at 19:23
  • @Scott: Thanks for the direction!. Now I can see what was wrong with my code. But when I try your code, I get syntax error message "awk: line 1: syntax error at or near [ ". Is this caused by variables expansion problem or escaping problem? It's difficult to find the reason. – awkprob Nov 28 '18 at 01:55
  • @Scott: I'm running Linux ubuntu 14.04 and after gnu awk installation, 'awk --version' say GNU Awk 3.1.8. But still have syntax error – awkprob Nov 28 '18 at 04:54
  • OK, according to [this](https://unix.stackexchange.com/q/96100/23408 "Multidimensional arrays in awk"), multidimensional arrays were added to GNU awk in version 4.0.0.  Can you upgrade to a newer version?   If not, I'll try to fix this to work another way, but I might not be able to get to it today. – Scott - Слава Україні Nov 28 '18 at 05:33
  • @Scott: I upgraded gawk and, now it works!! Thanks for your advice and help!!! :) – awkprob Nov 28 '18 at 06:18
  • @awkprob: We should probably delete these comments now.  Move your pointer to the end of each of your comments, look for a white X in a red circle, and click on it. (You may need to register your account before you can do this.) – Scott - Слава Україні Nov 29 '18 at 23:16
2

Using awk summing up the columns 2-4 based on 1.

awk -v FS="\t" -v OFS="\t" '{ col1[$1]+=$2; col2[$1]+=$3; col3[$1]+=$4; next } END { for ( i in col1) print i, col1[i], col2[i], col3[i]  }' file
Inian
  • 12,472
  • 1
  • 35
  • 52