PCRE-regex Use grep to exclude a capturing group

Question

I am using GNU grep with the -P PCRE Regex support for matching strings from a file. The input file has lines containing strings like:

FOO_1BAR.zoo.2.someString:More-RandomString (string here too): 0.45654343

I want to capture the numbers 2 and 0.45654343 from the above line. I used a regEx

grep -Po ".zoo.\K[\d+](.*):\ (.*)$" file

But this is producing me a result as

2.someString:More-RandomString (string here too): 0.45654343

I am able to get the first number from the first capturing group as 2, and also to match a capturing group at the end of the line. But I am not able to skip the words/lines between two capturing groups.

I know for a fact that I have a group (.*) that is capturing those words in the middle. What I've tried to do is include another \K to ignore it as

grep -Po ".zoo.\K[\d+](.*):\K (.*)$" file

But that gave me only the second capture group as 0.556984.

Also with a non-capturing group with the (?:) syntax as

grep -Po ".zoo.\K[\d+](?=.someString:More-RandomString (string here too)):\ (.*)$"

But this gave me nothing. What am I missing here?

You're missing basic understanding of how Perl regexps are supposed to work. You're also missing basic sense of not trying to do this with a single `grep` command. — Satō Katsura, Nov 28 '16 at 09:45
@SatoKatsura: I wanted to use a single `grep` and I hoped it would be possible. And the reason for `You're missing basic understanding of how Perl regexps are supposed to work`? I did a decent attempt to solving the issue — Inian, Nov 28 '16 at 09:50
`\K` doesn't do what you seem to think it does. Neither does `[\d+]`. — Satō Katsura, Nov 28 '16 at 09:51
@SatoKatsura: Why do you think that? Can you point me how is it incorrect? — Inian, Nov 28 '16 at 09:53
Because (1) it doesn't make sense to have more than one `\K` in the same regexp, and (2) how do you explain the output of something like this: `echo 1+2 | grep -Po '[\d+]'`? — Satō Katsura, Nov 28 '16 at 09:57
@SatoKatsura: Appreciate your comments. Will learn more about PCRE syntaxes. — Inian, Nov 28 '16 at 10:05

Stéphane Chazelas · Accepted Answer · 2016-11-28T14:32:27.040

grep's name comes after the g/re/p ed command. Its primary purpose is to print the lines that match a regexp. It's not its role to edit the content of those lines. You have sed (the stream editor) or awk for that.

Now, some grep implementations, starting with GNU grep added a -o option to print the matched portion of each line (what is matched by the regexp, not its capture groups). You've got some grep implementation like GNU's again (with -P) or pcregrep that support PCREs for their regexps.

pcregrep actually added a -o<n> option to print the content of a capture group. So you could do:

pcregrep -o1 -o2 --om-separator=' ' '.zoo.(\d+).*:\s+(.*)'

But here, the obvious standard solution is to use sed:

sed -n 's/^.*\.zoo\.\([0-9]\{1,\}\).*:[[:space:]]\{1,\}/\1 /p'

Or if you want perl regexps, use perl:

perl -lne 'print "$1 $2" if /\.zoo\.(\d+).*:\s+(.*)/'

With GNU grep, if you don't mind the matches to appear on different lines, you can do:

$ grep -Po '\.zoo\.\K\d+|:\s+\K.*' < file
2
0.45654343

Note that while \K resets the start of the matched portion, that doesn't mean you can get away with the two parts of the alternation overlapping.

grep -Po '.zoo.(\K\d+|.: \K.)'

would not work, just like echo foobar | grep -Po 'foo|foob' wouldn't work (at printing both foo and foob). foo|foob first matches foo and then grep looks for potential other matches in the input after the foo, so starting at the b of bar, so can't find any more after that.

Above with grep -Po '\.zoo\.\K\d+|:\s+\K.*', we only look for :<spaces><anything> in the second part of the alternation. That does match in the part that is after .zoo.<digits> but that also means it would find those :<spaces><anything> anywhere in the input, not only when they follow .zoo.<digits>.

There is a way to work around that though, using another PCRE special operator: \G. \G matches at the start of the subject. For a single match, that's equivalent to ^, but with multiple matches (think of sed/perl's g flag in s/.../.../g) like with -o where grep tries to find all the matches in the line, that also matches after the end of the previous match. So if you make it:

grep -Po '\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

Where (?!^) is a negative look-ahead operator that means not at the beginning of the line, that \G will only match after a previous successful (non-empty) match, so .*:\s+\K.* will only match if it follows a previous successful match, and that can only be the .foo.<digits> one since the other part of the alternation matches til the end of the line.

On an input like:

.zoo.1.zoo.2 tar: blah

That would output:

1
2
blah

Though. If you did not want that, you'd also want the first part of the alternation to only match at the beginning of the line. Something like

grep -Po '^.*?\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

That still outputs 2 on an input like .zoo.2 no colon character or .zoo.2 blah:. Which you could work around with a look-ahead operator in the first part of the alternation, and look for at least one non-space after :<spaces> (and also using $ to avoid issues with non-characters)

grep -Po '^.*?\.zoo\.\K\d+(?=.*:\s+\S.*$)|(?!^)\G.*:\s+\K\S.*$'

You'd probably need a few pages of comments to explain that regexp, so I would still go for the straightfoward sed/perl solutions...

Appreciate your answer, did I miss something in my question. Do you mean that I simply can't do what I intended to do with a single `grep`? — Inian, Nov 28 '16 at 09:51
@Inian, You can't easily with a single invocation of the current version of GNU `grep` (the one I suppose you're trying to use as it seems it supports `-P` and `-o` though that could also be the one of FreeBSD/OS/X that are rewrites of GNU grep). You can with other `grep` implementations like `pcregrep`. But I argue you're picking the wrong tool for the task. Use `sed` to edit streams. — Stéphane Chazelas, Nov 28 '16 at 09:54
I am quite easily able to do this only using bash native regex as `[[ "$string" =~ .zoo.([[:digit:]]+).*:\ (.*)$ ]]` and print as `printf "%s\t%s\n" "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]//[[:blank:]]}"` — Inian, Nov 28 '16 at 09:59
Thought `grep` could do this in someway. Anyway am accepting this answer agreeing it can't be done with a single invocation and some useful stuff on `pcregrep` which I haven't used before. — Inian, Nov 28 '16 at 10:00
Actually, the syntax `grep -Po '\.zoo\.\K\d+|: \K.*'` worked fine for me? But is there a way you can tell me to remove the whitespaces in the 2nd capturing group? It is currently printing it with a space in a new line. — Inian, Nov 28 '16 at 10:17
See edit. Replaced one space with `\s+` as I suppose you had more than one space after the `:`. Also added a way to make sure the `:\s+.*` only matches if `.zoo.` has been found beforehand. — Stéphane Chazelas, Nov 28 '16 at 10:42

PCRE-regex Use grep to exclude a capturing group

1 Answers1