pcregrep newline lookbehind assertion bug?

Question

Attempting to use pcregrep to print the first line after a blank line. For example, contents of file

first line

second line

I need second line to be printed. Here are a few tests using the same regular expression throughout

With Python 2.7

python -c "import re; print re.search(r'(?<=\n\n).*?$',\
    open('file').read(), re.MULTILINE).group()"
second line

With GNU grep 2.16

grep -oPz  '(?<=\n\n).*?$' file
second line

With pcregrep version 8.12

pcregrep -Mo  '(?<=\n\n).*?$' file
(no output)

Based on a few tests, pcregrep supports lookbehind assertions in general but does not seem to be able to deal with \n within lookbehind assertions in particular. \n within lookahead assertions presents no problem.

Tested on RHEL as well as Ubuntu. Any ideas?

Fedora 19's version, `pcregrep version 8.32 2012-11-30` does the same thing. — slm, May 02 '14 at 00:54

slm · Answer 1 · 2014-05-02T01:49:18.000

Apparently you can specify to pcregrep what type of newline you want it to look for. The -N switch does this when usin PCRE mode.

-N newline-type, --newline=newline-type The PCRE library supports five different conventions for indicating the ends of lines. They are the single-character sequences CR (carriage return) and LF (linefeed), the two-character sequence CRLF, an "anycrlf" convention, which recognizes any of the preceding three types, and an "any" convention, in which any Unicode line ending sequence is assumed to end a line. The Unicode sequences are the three just mentioned, plus VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029).

When the PCRE library is built, a default line-ending sequence is specified. This is normally the standard sequence for the operating system. Unless otherwise specified by this option, pcregrep uses the library's default. The possible values for this option are CR, LF, CRLF, ANYCRLF, or ANY. This makes it possible to use pcregrep to scan files that have come from other environments without having to modify their line endings. If the data that is being scanned does not agree with the convention set by this option, pcregrep may behave in strange ways. Note that this option does not apply to files specified by the -f, --exclude-from, or --include-from options, which are expected to use the operating system's standard newline sequence.

Example

$ pcregrep -Mo  -N CRLF '(?<=\n\n).*?$' sample.txt 
second line

$

Other odd behavior

Interestingly changing from a lookbehind to a lookahead yields results:

$ pcregrep -Mo  '(?>\n\n).*?$' sample.txt 


second line
$

slm, great observation, this seems to work. But funnily enough, `pcregrep` can handle `\n` in look-ahead assertions without the need for `-N CRLF`! Additionally, the newlines in my file are `LF` not `CRLF` which makes the apparent success of this technique all the more puzzling! — iruvar, May 02 '14 at 01:21
@slm also doesn't appear to explain why `pcregrep -Mo '\n\n\K.*$' file` _does_ appear to work (at least on my Ubuntu 12.04 box - pcregrep version 8.12) — steeldriver, May 02 '14 at 01:28
@1_CR - it's interesting that it doesn't appear to work with ANYCRLF or ANY either. — slm, May 02 '14 at 01:39
Source for pcregrep.c is here: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/pcregrep_8c_source.html — slm, May 02 '14 at 01:42
@steeldriver - the wikipedia page had this to say about the `\K`. _Since version 7.2, \K can be used in a pattern to reset the start of the current whole match. This provides a flexible alternative approach to look-behind assertions because the discarded part of the match (the part that precedes \K) need not be fixed in length._ https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions — slm, May 02 '14 at 01:51

pcregrep newline lookbehind assertion bug?

1 Answers1

Example

Other odd behavior