5

I have a program that sums a column in a file:

awk -v col=2 '{sum+=$col}END{print sum}' input-file

However, it has a problem: If you give it a file that doesn't have numeric data, (or if one number is missing) it will interpret it as zero.

I want it to produce an error if one of the fields cannot be parsed as a number.

Here's an example input:

bob 1
dave 2
alice 3.5
foo bar

I want it to produce an error because 'bar' is not a number, rather than ignoring the error.

Nick ODell
  • 2,498
  • 2
  • 20
  • 28

5 Answers5

8

A reasonable way to test would be to compare the field using tests similar to strtod, which is the method that awk uses to convert strings to numbers:

$2 !~ / *[+-]?[[:digit:]]/ { print "NAN: " $2; exit 1; }

The above differs from strtod in that it does not consider INFINITY or NAN to be "numbers". The leading space requirement could be relaxed under awk's default field-splitting behavior -- meaning the fields would never contain leading space:

$2 !~ /[+-]?[[:digit:]]/ { print "NAN: " $2; exit 1; }

A further refinement, thanks to Stéphane's comment and answer here:

$2 !~ /^[+-]?([[:digit:]]*\.?[[:digit:]]*([eE][-+]?[[:digit:]]+)?|0[xX][[:xdigit:]]*\.?[[:xdigit:]]*([pP][-+]?[[:digit:]]+)?)$/ { print "NAN: " $2; exit 1; }

Broken out for slightly better legibility, that regex is:

/^[+-]?([[:digit:]]*\.?[[:digit:]]*([eE][-+]?[[:digit:]]+)?|\
        0[xX][[:xdigit:]]*\.?[[:xdigit:]]*([pP][-+]?[[:digit:]]+)?)$/

... where the intention is to allow a possible leading + or -, then either a floating point number or hexadecimal number. The floating point number has optional leading digits, an option separator (here fixed to be a period .), followed by some number of digits, optionally followed by an exponent. The hex number must start with 0x or 0X, followed by hex digits, a separator, more hex digits, and optionally followed by a "power" (exponent). The entire second field must match one of those formats (as anchored by ^ and $). Omitted here, for the purposes of this question, are the NAN and INFINITY options.

Another option would be to force a numeric conversion, then compare it to zero and then further compare the original input to something that would convert to zero; more specifically, does it start with an optional + or -, then is it followed by zeros, or followed by a period and zeros:

{ number=0 + $2;
  if (!number && $2 !~ /^[+-]?(0+)|\.0+/)
    print "NAN: "$2;
}
Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
  • 1
    You'd need to anchor the regexp (with `^`), otherwise, it's just a test whether the second field contains a decimal digit. You'd also need to take care of numbers like `.123`. There's also whether you want to honour user's decimal radix (comma vs period). The last one would let strings like `.`, `+`, `+.` – Stéphane Chazelas Jan 16 '19 at 21:49
  • Great points, Stéphane; thank you. I will correct the simpler aspects shortly. – Jeff Schaller Jan 17 '19 at 00:43
  • notice that when converting strings to numbers awk will ignore any trailing junk (see my answer) and `/^[+-]?0*\.?0*$/` is far from being a thorough pattern matching strings that will be *completely* parsed to `0` by `strtod` (what about `0e13`, `.0e-0`, `0x0`, `0x.0p+13`, etc?) –  Jan 17 '19 at 01:58
  • Point well taken, @UncleBilly, as also noted by Stéphane above. I'm working on the "zero" regex right now... – Jeff Schaller Jan 17 '19 at 02:00
6

I ended up with this:

awk -v col=$col '
typeof($col) != "strnum" {
    print "Error on line " NR ": " $col " is not numeric"
    noprint=1
    exit 1
}
{
    sum+=$col
}
END {
    if(!noprint)
        print sum
}' $file

This uses typeof, which is a GNU awk extension. typeof($col) returns 'strnum' if $col is a valid number, and 'string' or 'unassigned' if it is not.

See Can I determine type of an awk variable?

Nick ODell
  • 2,498
  • 2
  • 20
  • 28
3
awk -v col=2 '
    $col+0==0 && $col!~/^[+-]?0/ { print "bad number " $col > "/dev/stderr" } 
    {sum+=$col}
    END{print sum}' input-file

It's up to you to complicate it if you want it to also handle .0 or .0e+33 as valid representations of 0; notice that awk will ignore trailing junk when converting strings to numbers ("1.4e1e3"+0, "1.4e1.e7"+0 or "14+13"+0 will be all equal to 14).

2

If you give it a file that doesn't have numeric data,

$col ~ /[^-.[:digit:]]/ {print "Error, non-numeric :"; print $col; exit 1};

Explanation just use a RegEx to check for the presence of characters which are not digits nor floating point, sign, etc.

(or if one number is missing)

add

|| ($col == "")

or

|| (length($col) == 0)

to the rule.

Or you could use a comparison to NF if it's the last column like in your example.

DrYak
  • 336
  • 1
  • 6
0

A little trick I use is trying to get the square root from the string. awk will do its utmost to make it right so if trailing 0 is not a number this method is not for you.

if ( sqrt(_varname) ) {
 print "this " _varname " is a (positive) number"
} else {
 print "this " _varname " is not a number..."
}
AdminBee
  • 21,637
  • 21
  • 47
  • 71