group and count by a regex

Question

I have dozens of values in a file such as

(1608926678.237962) vcan0 123#0000000158
(1608926678.251533) vcan0 456#0000000186

I want to count how many of each there are based on the numbers before the hash symbol (can include it also)

I have tried to following but keep getting zero

 grep -o '\b\d+#\b' ./file.log | wc -l

Any ideas? For the above example I would want:

123# 1
456# 1

Neither `\d` nor the `+` qualifier are supported by BRE grep - see for example [Why does my regular expression work in X but not in Y?](https://unix.stackexchange.com/questions/119905/why-does-my-regular-expression-work-in-x-but-not-in-y) — steeldriver, Dec 21 '20 at 19:40

jesse_b · Answer 1 · 2020-12-21T19:36:27.527

4

It's not exactly the output you described but if that is really a hard requirement it can be massaged to that format but:

awk -F'[ #]' '{print $3}' input | sort -n | uniq -c

The awk command will extract your number before # and then pass it to sort/uniq. uniq -c will provide a count of each value.

To get your output format:

awk -F'[ #]' '{print $3}' input | sort -n | uniq -c | awk '{print $2"#",$1}'

edited Dec 21 '20 at 19:36

answered Dec 21 '20 at 19:31

jesse_b

35,934
12
91
140

score 4 · Answer 2 · answered Dec 21 '20 at 19:33

4

grep + Bash:

$ grep -Eo '\b[0-9]+#\b' ./file.log  | sort | uniq -c  | while read -r a b; do echo "$b" "$a"; done
123# 1
456# 1

answered Dec 21 '20 at 19:33

Arkadiusz Drabczyk

25,049
5
53
68

1

That while loop is just `awk '{print $2, $1}`, and I’m sure there are options with other tools. Why write a loop you don’t need? – D. Ben Knoble Dec 22 '20 at 14:24
First, you forgot `'` and second - why not? There are many ways to do what OP requested. This solution uses Bash, other solutions use `awk` which was added to the list of tags after OP asked the question - see https://unix.stackexchange.com/posts/625570/revisions – Arkadiusz Drabczyk Dec 22 '20 at 14:27
Well, if we’re nit-picking to that level, your answer isn’t grep + bash either, since you use sort and uniq. I am in favor of not using a while-read loop in bash where possible—they tend to be slower than the equivalent approach using a dedicated tool. And since you already used a few other tools, as mentioned, there’s no harm in throwing another (awk) into the mix for the field re-writing. – D. Ben Knoble Dec 22 '20 at 14:29
Yes, I'm the one who's *nit-picking* :) Have a nice day. – Arkadiusz Drabczyk Dec 22 '20 at 14:32
1

The while loop is much slower than an awk equivalent, but more importantly, I don't understand why you would want it. What does it offer that `unic -c` doesn't do already? If you just want to change `1 123#` to `123# 1`, then using a shell loop is probably the most inefficient and slow way of doing it, so it seems like an odd choice. – terdon Dec 22 '20 at 15:23
As for ["why not?"](https://unix.stackexchange.com/questions/625570/group-and-count-by-a-regex#comment1170807_625572) - see [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) – Ed Morton Dec 22 '20 at 15:25

αғsнιη · Answer 3 · 2020-12-21T20:37:32.013

With GNU awk:

awk -v FPAT=' [0-9]+#' '{ c[$1]++; }; END{ for(x in c) print x, c[x]; }' infile
 123# 1
 456# 1

Assuming there is always one pattern " [0-9]+#" matched per line as shown in your given sample input;

to filtering out the whitespaces from the result and also during processing for a input like:

(1608926678.237962) vcan0        123#0000000158
(1608926678.251533) vcan0 456#0000000186
(1608926678.237962) vcan0    123#0000000158
(1608926678.251533) vcan0 456#0000000186
(1608926678.237962) vcan0      123#0000000158
(1608926678.251533) vcan0                       456#0000000186
(1608926678.237962) vcan0 123#0000000158

awk -v FPAT='[ \t][0-9]+#' '{
    filter=$1; sub(/[ \t]/, "", filter);
    c[filter]++;
};
END{ for(x in c) print x, c[x]; }' infile
456# 3
123# 4

for a input having multiple matched pattern " [0-9]+#" in each or every lines, you would do:

awk -v FPAT='[ \t][0-9]+#' '{
    for (i=1; i<=NF; i++){ 
        filter=$i; sub(/[ \t]/, "", filter); c[filter]++;
    };
};
END{ for(x in c) print x, c[x]; }' infile

score 2 · Answer 4 · answered Dec 21 '20 at 20:57

2

With any awk in any shell on every Unix box:

$ awk -F'[ #]' '{cnt[$3]++} END{for (val in cnt) print val"#", cnt[val]}' file
123# 1
456# 1

answered Dec 21 '20 at 20:57

Ed Morton

28,789
5
20
47

score 0 · Answer 5 · answered Dec 22 '20 at 18:38

0

awk '{for(i=1;i<=NF;i++){if($i ~ /#/){print $i}}}' filename| awk -F "#" '{print $1"#",gsub($1,$0)}'

output

123# 1
456# 1

answered Dec 22 '20 at 18:38

Praveen Kumar BS

5,139
2
9
14

group and count by a regex

5 Answers5