9

I have a CSV file (in which the field separator is indeed comma) with 8 columns and a few million rows. Here's a sample:

1000024447,38111220,201705,181359,0,12,1,3090
1064458324,38009543,201507,9,0,1,1,1298
1064458324,38009543,201508,9,0,2,1,90017

What's the fastest way to print the sum of all numbers in a given column, as well as the number of lines read? Can you explain what makes it faster?

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
Elifarley
  • 283
  • 1
  • 2
  • 6
  • Don't forget that sed can add, too! https://unix.stackexchange.com/questions/36949/addition-with-sed – Jeff Schaller Jul 02 '18 at 13:26
  • Do you mean literally fastest as in performance? Or fastest to type on the fly, i.e. simplest? (Fastest for the programmer or fastest for the computer?) How long is your file? – Wildcard Jul 02 '18 at 19:48
  • Good question. I was thinking about how fast it can run. It's about a few million rows. – Elifarley Jul 02 '18 at 20:47

3 Answers3

18

GNU datamash

$ datamash -t, count 3 sum 3 < file
3,604720

Some testing

$ time gawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
604720000000 3000000

real    0m2.851s
user    0m2.784s
sys     0m0.068s

$ time mawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
6.0472e+11 3000000

real    0m0.967s
user    0m0.920s
sys     0m0.048s

$ time perl -F, -nle '$sum += $F[2] }{ print "$.,$sum"' longfile
3000000,604720000000

real    0m3.394s
user    0m3.364s
sys     0m0.036s

$ time { cut -d, -f3 <longfile |paste -s -d+ - |bc ; }
604720000000

real    0m1.679s
user    0m1.416s
sys     0m0.248s

$ time datamash -t, count 3 sum 3 < longfile
3000000,604720000000

real    0m0.815s
user    0m0.716s
sys     0m0.036s

So mawk and datamash appear to be the pick of the bunch.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
steeldriver
  • 78,509
  • 12
  • 109
  • 152
  • To be fair, I would suggest to skip `printf` function call in the above `awk` timings as the other tools don't give any text formatted outputting (they can't). Just `awk -F',' '{ sum += $3 } END{ print sum, NR }' file` – RomanPerekhrest Jul 02 '18 at 14:38
  • @RomanPerekhrest updated – steeldriver Jul 02 '18 at 14:53
  • You might also want to try gawk with the `-M` option, based on Sundeep's comment on Roman's answer. This is an excellent answer. Thank you for introducing me to datamash. – David Conrad Jul 02 '18 at 22:18
  • 1
    With LC_ALL=C gawk gives comparable timings to mawk. mawk doesn't support multi-byte characters. – Stéphane Chazelas Jul 03 '18 at 07:37
7

Awk is a fast and performant tool for processing text files.

awk -F',' '{ sum += $3 }
           END{ printf "Sum of 3rd field: %d. Total number of lines: %d\n", sum, NR }' file

Sample output:

Sum of 3rd field: 604720. Total number of lines: 3

Conceptual note:
I must note that all those non-awk alternatives are able to run faster only for such "ideal" numeric columns. It only costs for you to have a slightly more complex format (for ex. with some additional information to be stripped before calculation <1064458324:a,<38009543:b,<201507:c,<9:d,<0:e,<1:f,<1:g,1298) and all those speed advantages will gone away (not to mention that some of them won't to able for perform the needed processing).

RomanPerekhrest
  • 29,703
  • 3
  • 43
  • 67
1

You could use cut to extract the field, paste to insert + signs between the numbers, and bc to sum them. You could use wc to count the lines.

I'm not certain what the performance would be over millions of lines, though. But the golden rule of performance is, don't guess, measure. You need to profile any solution to see if it provides the performance you need, and to determine whether any changes increase or decrease the performance, and by how much.

Here is a solution that sums a given field and prints the number of lines:

echo -n "Sum: "
cut -d, -f3 <file |paste -s -d+ |bc
echo -n "Lines: "
wc -l <file

Output:

Sum: 604720
Lines: 3

The field is specified by the -f# parameter to cut, here cut -f3.

David Conrad
  • 298
  • 2
  • 7