6

I'm trying to calculate the geometric mean of a file full of numbers (1 column).

The basic formula for geometric mean is the average the natural log (or log) of all the values and then raise e (or base 10) to that value.

My current bash only script looks like this:

# Geometric Mean
count=0;
total=0; 

for i in $( awk '{ print $1; }' input.txt )
  do
    if (( $(echo " "$i" > "0" " | bc -l) )); then
        total="$(echo " "$total" + l("$i") " | bc -l )"
        ((count++))
    else
      total="$total"
    fi
  done

Geometric_Mean="$( printf "%.2f" "$(echo "scale=3; e( "$total" / "$count" )" | bc -l )" )"
echo "$Geometric_Mean"

Essentially:

  1. Check every entry in the input file to make sure it is larger than 0 calling bc every time
  2. If the entry is > 0, I take the natural log (l) of that value and add it to the running total calling bc every time
  3. If the entry is <=0, I do nothing
  4. Calculate the Geometric Mean

This works perfectly fine for a small data set. Unfortunately, I am trying to use this on a large data set (input.txt has 250,000 values). While I believe this will eventually work, it is extremely slow. I've never been patient enough to let it finish (45+ minutes).

I need a way of processing this file more efficiently.

There are alternative ways such as using Python

# Import the library you need for math
import numpy as np

# Open the file
# Load the lines into a list of float objects
# Close the file
infile = open('time_trial.txt', 'r')
x = [float(line) for line in infile.readlines()]
infile.close()

# Define a function called geo_mean
# Use numpy create a variable "a" with the ln of all the values
# Use numpy to EXP() the sum of all of a and divide it by the count of a
# Note ... this will break if you have values <=0
def geo_mean(x):
    a = np.log(x)
    return np.exp(a.sum()/len(a))

print("The Geometric Mean is: ", geo_mean(x))

I would like to avoid using Python, Ruby, Perl ... etc.

Any suggestions on how to write my bash script more efficiently?

Paulo Tomé
  • 3,754
  • 6
  • 26
  • 38
Matt
  • 73
  • 3
  • 2
    You are running two subshells and two bc external processes per input value, so around a million processes in total. awk will deal with the whole input in a single process, probably in under 30 seconds. – Paul_Pedant Mar 02 '20 at 18:48
  • 2
    If you need efficiency, forget bash (or any other shell). The shell is not designed for this sort of thing and will _always_ be the slowest and least efficient solution possible. – terdon Mar 02 '20 at 18:49
  • 1
    Since you're already using awk, use awk throughout: `awk '$1 > 0 {n++; s += log($1)} END{if(n)print exp(s/n)}' your_file`. Use `-v OFMT=%.16g` if you want more digits. –  Mar 02 '20 at 18:58
  • Paul_Pedant & mosvy thank you so much. awk was able to perform all the calculations win < 5 sec. I clearly need to do some awk homework. I really appreciate your help! – Matt Mar 02 '20 at 19:00
  • Also, if your awk is GNU awk you may be able to do the calculation with arbitrary precision numbers instead of doubles by using the `-M` or `--bignum` option (check with `gawk --version` if it was compiled with gmp/mpfr support). –  Mar 02 '20 at 19:10
  • @mosvy thanks so much again for the help, sincerely appreciate it. For reference the wall clock time on the awk code you provided was 0.122 sec compared to the 0.208 sec for the Python script. Additionally, if we combine the 'user' and 'sys' time awk completed it in 0.125 sec while Python took 1.125 sec. Thanks again! – Matt Mar 02 '20 at 19:10
  • @mosvy please don't answer questions in comments. That circumvents the normal quality control procedures of the site since comments cannot be downvoted and also mean that the question isn't marked as answered. – terdon Mar 02 '20 at 19:14
  • You can always delete my comments instead of downvoting them. BTW, could you explain the purpose of `E=exp(1)` .. `E^m` instead of just `exp(m)` in your answer? –  Mar 02 '20 at 19:20
  • Well, I'd rather you post an answer so I can upvote it instead, @mosvy. As for the `E=exp(1)`, that was just the first way that I found to get the value of `e` so that I could then raise it to the power returned by the `tot/c`. Your `exp()` approach seems much better, but I didn't know about it. Yet another reason why posting answers is better :). – terdon Mar 02 '20 at 19:22
  • 1
    Why the extra quotes and spaces in `" "$i" > "0" "`? Could be `"$i > 0"` – user253751 Mar 03 '20 at 11:36
  • I agree with user253751 — it’s rare that we criticize a post for having *too many* quotes, but this is such a case.   Shell variables should always be put inside quotes, unless you have a good reason not to, and you’re sure you know what you’re doing — see [this](https://unix.stackexchange.com/q/131766/80216 "Why does my shell script choke on whitespace or other special characters?") and [this](https://unix.stackexchange.com/q/171346/80216 "Security implications of forgetting to quote a variable in bash/POSIX shells") for details.   For example, the next line should be `echo "$total + l($i)"`. – G-Man Says 'Reinstate Monica' Mar 04 '20 at 04:16
  • Related references: [How to do integer & float calculations, in bash or other languages/frameworks?](https://unix.stackexchange.com/q/40786/80216)  and  [Doing simple math on the command line …](https://unix.stackexchange.com/q/30478/80216) – G-Man Says 'Reinstate Monica' Mar 04 '20 at 04:16

2 Answers2

15

Please don't do this in the shell. There is no amount of tweaking that would ever make it remotely efficient. Shell loops are slow and using the shell to parse text is just bad practice. Your whole script can be replaced by this simple awk one-liner which will be orders of magnitude faster:

awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' file

For example, if I run that on a file containing the numbers from 1 to 100, I get:

$ seq 100 > file
$ awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' file
37.99

In terms of speed, I tested your shell solution, your python solution and the awk I gave above on a file containing the numbers from 1 to 10000:

## Shell
$ time foo.sh
3677.54

real    1m0.720s
user    0m48.720s
sys     0m24.733s

### Python
$ time foo.py
The Geometric Mean is:  3680.827182220091

real    0m0.149s
user    0m0.121s
sys     0m0.027s


### Awk
$ time awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' input.txt
3680.83

real    0m0.011s
user    0m0.010s
sys     0m0.001s

As you can see, the awk is even faster than the python and far simpler to write. You can also make it into a "shell" script, if you like. Either like this:

#!/bin/awk -f

BEGIN{
    E = exp(1);
} 
$1>0{
    tot+=log($1);
    c++;
}
 
END{
    m=tot/c; printf "%.2f\n", E^m
}

or by saving the command in a shell script:

#!/bin/sh
awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++;} END{m=tot/c; printf "%.2f\n", E^m}' "$1"
terdon
  • 234,489
  • 66
  • 447
  • 667
0

Here are some suggestions. I can't test them without knowing exactly what is in your file but I hope this helps. There are always different, better ways to do things so this is not at all exhaustive.


Change the if condition

if (( $(echo " "$i" > "0" " | bc -l) )); then

Change it to:

if [[ "$i" -gt 0 ]]; then

The first line creates multiple processes even though it is just doing simple math. A solution is to use the [[ shell keyword.


Remove unneeded code

else
  total="$total"

This is basically a way to explicitly waste time doing nothing :). These 2 lines can be removed outright.

JamesL
  • 1,260
  • 1
  • 13
  • 19