6

I need to take the sum of a field which is tilde (~) delimited. Problem I have is that my data is also having delimiter escaped.

Example

1~CEO~ashok\~kumar~1000

As we see in 3rd field above we have escaped delimitter which I want to avoid. I'm running the below command which doesn't handle this.

$ cat test.out|awk -F'~' 'BEGIN {sum=0} {sum+=$4} END{print sum}'

Assume the test.out data as:

1~CEO~ashok\~kumar~1000
2~CFO~Ranjan~2000
3~CEO~kumar~1000

So my output should be 4000. But currently with my command I get only 3000!

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
Ranjan
  • 61
  • 1
  • 2
  • 1
    Since it's the last column, you could simply say: `awk -F'~' 'BEGIN {sum=0} {sum+=$NF} END{print sum}' test.out` – devnull May 09 '14 at 05:21

6 Answers6

1

Just change the escaped delimiter to something else before processing with awk. This can be done with sed:

$ cat test.out| sed 's/\\~/=/g' | \
    awk -F'~' 'BEGIN {sum=0} {sum+=$4} END{print sum}'
4000

And, as is often the case, the cat is not needed:

$ sed 's/\\~/=/g' test.out | awk -F'~' 'BEGIN {sum=0} {sum+=$4} END{print sum}'
slm
  • 363,520
  • 117
  • 767
  • 871
John1024
  • 73,527
  • 11
  • 167
  • 163
1

Here's an alternative that doesn't use awk:

$ sed 's/\\~/=/g' test.out | cut -d"~" -f4 | paste -sd+ | bc
4000

The above uses sed to swap out the escaped tilde \~ in the 3rd column. Afterwards we can make use of cut to select the 4th column of numbers, and then reconstruct them so that they're separated by plus signs, (+).

$ sed 's/\\~/=/g' test.out | cut -d"~" -f4 | paste -sd+
1000+2000+1000

This string is then given to the binary calculator, bc which sums them up.

slm
  • 363,520
  • 117
  • 767
  • 871
1

It seems that you want to sum up all value in the last column of each line. so try this:

$ awk -F'~' '{sum += $NF};END{print sum}' file
4000

If you can use perl:

$ perl -F'~' -anle '$sum += $F[-1]; END {print $sum}' file
4000

or:

$ perl -nle '$sum += $1 if /(\d+$)/; END {print $sum}' file
4000
cuonglm
  • 150,973
  • 38
  • 327
  • 406
1

To deal with escaping, a general method is to use perl or PCRE and their alternation regexp operator combined with the no-backtrack operator. Here with GNU grep:

grep -Po '(?>(?:\\.|.)*?~){3}\K(?:\\.|[^~])*' << \EOF
1~CEO~ashok\~kumar~1000
2~CFO~Ranjan~2000
3~CEO~kumar~1000
4~field2~field3\\~10000~field5-note-the-escaped-backslash-not-tilde
5~a\~b\~c\~no-4th-field-here
EOF

which gives:

1000
2000
1000
10000

(which you can sum with your usual awk '{s+=$0};END{print s}').

With GNU sed, you can also do it with:

sed -rn 's/((\\.|[^\~])*~){3}((\\.|[^~])*).*/\3/p'

With GNU awk, you could use FPAT to define fields as sequences of either escaped characters or non-tilde-nor-backslash characters:

awk -v FPAT='(\\\\.|[^\\\\~])*' '{print $4}'
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
0

This is a bit clumsy in awk (unless you can preprocess your source to change the delimiter, but that requires knowing another character or character sequence that cannot appear in the input). One thing you can do is read a whole line, then massage the line to get newlines as separators (newlines are the one thing that cannot possibly appear in a line).

awk 'BEGIN {FS="\n"}
{
    gsub("~", "\n");
    gsub("\\\n", "~");
    gsub("\\\\", "\\");
    $0 = $0;
    print $4;
}'
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
0
for n in $(cat <<\DATA
    1~CEO~ashok\~kumar~1000
    2~CFO~Ranjan~2000
    3~CEO~kumar~1000
DATA
)
    do r=$((r+${n##*~})) ; done
echo $r

###OUTPUT
4000

Given the data you show a simple for loop can do it, but if there's a chance of <tab> or <space> in the data you might want to limit the splitting like:

IFS='
'

...first.

Or with sed and bc:

sed 's/.*[^0-9]/r+=/;$aprint r' <<\DATA | bc
    1~CEO~ashok\~kumar~1000
    2~CFO~Ranjan~2000
    3~CEO~kumar~1000
DATA

OUTPUT:

4000

Both methods rely on your target data being the last field in each line.

The first does simple shell iterable $((arithmetic)) on $((r)) after stripping everything up to and including your last delimiter from $n.

The second replaces all that removed in the first method with the string 'r+=' and |pipes the result to bc which then does r=r+${last_field}. On the $last line of input sed also appends another line:

print r

...and |pipes it to bc which then does as instructed and outputs as shown.

mikeserv
  • 57,448
  • 9
  • 113
  • 229