pcregrep excluding multiple lines regexp eats one more line than needed

Question

I want to filter out all lines starting with banana and all lines starting with a space after banana lines. I am using pcregrep. Consider the following file fruits.txt:

apple
banana starts matching
 this line should match
 this too
 and this
mango
pomelo

pcregrep happily finds what I want:

ars@ars-thinkpad ~/tmp/tmp $ pcregrep -M  'banana.*\n(\s.*\n)*' fruits.txt 
banana starts matching
 this line should match
 this too
 and this

However, if I try to exclude these lines, pcregrep eats mango too, which is not good:

ars@ars-thinkpad ~/tmp/tmp $ pcregrep -M -v 'banana.*\n(\s.*\n)*' fruits.txt 
apple
pomelo

Why?

score 1 · Answer 1 · answered Aug 01 '17 at 02:58

1

Your use of \s in the regex means that the expression can eat newlines. I'm not familiar enough with how the -v is implemented in pcregrep to know why it's not the inverse, but I'm pretty sure that's the cause.

If you change your file to be:

apple
banana starts matching
 this line should match
 this too
 and this

mango

pomelo

Then even without the -v, the matching looks like it's not what you intend.

$ pcregrep  -M 'banana.*\n(\s.*\n)*' fruits.txt
banana starts matching
 this line should match
 this too
 and this

mango

pomelo

If it's truly only a space at the beginning of the line that should match, I suggest changing the \s to one or more spaces " +".

When I change the regex to 'banana.*\n( +.*\n)*' It matches (both regular and inverse) in a way that I think is more correct. Maybe use [ \t]+ if tabs are allowed as well.

answered Aug 01 '17 at 02:58

BowlOfRed

3,628
13
18

You are right that I should use tabs or spaces instead of `\s`, but the general answer is not quite true: my `pcregrep` 8.39 still eats mango with `pcregrep -M -v 'banana.*\n( +.*\n)*'`. Yours doesn't? – ars Aug 01 '17 at 09:22
@ars may be you got `\r\n` line endings instead of just `\n`... check it with `cat -A` or `file` commands – Sundeep Aug 01 '17 at 11:12
While late to the party, I like `[^\S\r\n]` for "any kind of whitespace that isn't `\r` or `\n`". I was able to formulate a regex with and without the **-v** flag: `pcregrep -Mv 'banana.*(?:\n[^\S\r\n]+.*)*'` – OnlineCop Jul 25 '18 at 20:03

score 0 · Answer 2 · answered Aug 01 '17 at 04:42

0

Such tasks are better suited for awk imo

$ awk '!/^ /{f=0} /^banana/{f=1} f' fruits.txt 
banana starts matching
 this line should match
 this too
 and this
$ awk '!/^ /{f=0} /^banana/{f=1} !f' fruits.txt 
apple
mango
pomelo

The order of setting flag helps to easily print or negate the specific lines being searched as !/^ / condition is satisfied for line starting with banana as well
!/^ /{f=0} if line doesn't start with space, clear the flag
/^banana/{f=1} set the flag if line starts with banana
f prints the lines matching condition while !f negates the condition

answered Aug 01 '17 at 04:42

Sundeep

11,753
2
26
57

"Better suited" is a very personal opinion, I think (-; `pcregrep` is a specialized tool for such tasks, whereas `awk` is a general tool and needs a script that is longer and more complex than the expression needed for `pcregrep`. – Philippos Aug 01 '17 at 06:32
yeah, hence the `imo`... but I disagree about this being `complex` especially when pcre features like lookarounds are not needed here.. and it is definitely much better suited when it comes to dealing with various related cases https://stackoverflow.com/questions/17908555/printing-with-sed-or-awk-a-line-following-a-matching-pattern – Sundeep Aug 01 '17 at 06:45
Surely not complex, but at least more complex than a simple regex (a state machine and three conditional branches are hidden in that script). And of course: When cases get more complicated, tools like `sed` or `awk` or at some point `python` are better suited. Not yet in this case, IMHO. – Philippos Aug 01 '17 at 07:29
Unfortunately, this is a simplified example and I do need lookaround feature of pcregrep. Thanks anyway. – ars Aug 01 '17 at 10:44

pcregrep excluding multiple lines regexp eats one more line than needed

2 Answers2

Linked