How can I avoid escaped delimiter characters in a Unix command?

Question

I need to take the sum of a field which is tilde (~) delimited. Problem I have is that my data is also having delimiter escaped.

Example

1~CEO~ashok\~kumar~1000

As we see in 3rd field above we have escaped delimitter which I want to avoid. I'm running the below command which doesn't handle this.

$ cat test.out|awk -F'~' 'BEGIN {sum=0} {sum+=$4} END{print sum}'

Assume the test.out data as:

1~CEO~ashok\~kumar~1000
2~CFO~Ranjan~2000
3~CEO~kumar~1000

So my output should be 4000. But currently with my command I get only 3000!

Since it's the last column, you could simply say: `awk -F'~' 'BEGIN {sum=0} {sum+=$NF} END{print sum}' test.out` — devnull, May 09 '14 at 05:21

score 1 · Answer 1 · edited May 09 '14 at 05:23

1

Just change the escaped delimiter to something else before processing with awk. This can be done with sed:

$ cat test.out| sed 's/\\~/=/g' | \
    awk -F'~' 'BEGIN {sum=0} {sum+=$4} END{print sum}'
4000

And, as is often the case, the cat is not needed:

$ sed 's/\\~/=/g' test.out | awk -F'~' 'BEGIN {sum=0} {sum+=$4} END{print sum}'

edited May 09 '14 at 05:23

slm

363,520
117
767
871

answered May 09 '14 at 05:06

John1024

73,527
11
167
163

That assumes the backslashes themselves are not escaped (things like `field1~field2\\~field\~3` are not found in the input) – Stéphane Chazelas May 14 '14 at 14:45

score 1 · Answer 2 · answered May 09 '14 at 05:37

Here's an alternative that doesn't use awk:

$ sed 's/\\~/=/g' test.out | cut -d"~" -f4 | paste -sd+ | bc
4000

The above uses sed to swap out the escaped tilde \~ in the 3rd column. Afterwards we can make use of cut to select the 4th column of numbers, and then reconstruct them so that they're separated by plus signs, (+).

$ sed 's/\\~/=/g' test.out | cut -d"~" -f4 | paste -sd+
1000+2000+1000

This string is then given to the binary calculator, bc which sums them up.

cuonglm · Answer 3 · 2015-04-28T19:57:28.923

1

It seems that you want to sum up all value in the last column of each line. so try this:

$ awk -F'~' '{sum += $NF};END{print sum}' file
4000

If you can use perl:

$ perl -F'~' -anle '$sum += $F[-1]; END {print $sum}' file
4000

or:

$ perl -nle '$sum += $1 if /(\d+$)/; END {print $sum}' file
4000

edited Apr 28 '15 at 19:57

answered May 10 '14 at 06:00

cuonglm

150,973
38
327
406

Stéphane Chazelas · Answer 4 · 2014-06-06T21:23:02.327

To deal with escaping, a general method is to use perl or PCRE and their alternation regexp operator combined with the no-backtrack operator. Here with GNU grep:

grep -Po '(?>(?:\\.|.)*?~){3}\K(?:\\.|[^~])*' << \EOF
1~CEO~ashok\~kumar~1000
2~CFO~Ranjan~2000
3~CEO~kumar~1000
4~field2~field3\\~10000~field5-note-the-escaped-backslash-not-tilde
5~a\~b\~c\~no-4th-field-here
EOF

which gives:

(which you can sum with your usual awk '{s+=$0};END{print s}').

With GNU sed, you can also do it with:

sed -rn 's/((\\.|[^\~])*~){3}((\\.|[^~])*).*/\3/p'

With GNU awk, you could use FPAT to define fields as sequences of either escaped characters or non-tilde-nor-backslash characters:

awk -v FPAT='(\\\\.|[^\\\\~])*' '{print $4}'

score 0 · Answer 5 · answered May 10 '14 at 12:15

This is a bit clumsy in awk (unless you can preprocess your source to change the delimiter, but that requires knowing another character or character sequence that cannot appear in the input). One thing you can do is read a whole line, then massage the line to get newlines as separators (newlines are the one thing that cannot possibly appear in a line).

awk 'BEGIN {FS="\n"}
{
    gsub("~", "\n");
    gsub("\\\n", "~");
    gsub("\\\\", "\\");
    $0 = $0;
    print $4;
}'

mikeserv · Answer 6 · 2014-06-07T01:18:10.937

for n in $(cat <<\DATA
    1~CEO~ashok\~kumar~1000
    2~CFO~Ranjan~2000
    3~CEO~kumar~1000
DATA
)
    do r=$((r+${n##*~})) ; done
echo $r

###OUTPUT
4000

Given the data you show a simple for loop can do it, but if there's a chance of <tab> or <space> in the data you might want to limit the splitting like:

IFS='
'

...first.

Or with sed and bc:

sed 's/.*[^0-9]/r+=/;$aprint r' <<\DATA | bc
    1~CEO~ashok\~kumar~1000
    2~CFO~Ranjan~2000
    3~CEO~kumar~1000
DATA

OUTPUT:

Both methods rely on your target data being the last field in each line.

The first does simple shell iterable $((arithmetic)) on $((r)) after stripping everything up to and including your last delimiter from $n.

The second replaces all that removed in the first method with the string 'r+=' and |pipes the result to bc which then does r=r+${last_field}. On the $last line of input sed also appends another line:

print r

...and |pipes it to bc which then does as instructed and outputs as shown.

How can I avoid escaped delimiter characters in a Unix command?

Example

6 Answers6

OUTPUT:

Linked