Why does this grep statement do the opposite of what I expect?

Question

I have a file with some HTML and some text mixed in, I just wanted to use the text lines.

I was fooling around with grep, trying to get a way to exclude the lines that began with an HTML tag, even lines that included whitespace before the tag.

Somehow this works for me, but I don't think it should:

grep '^\<' file.htm

It just shows me the lines without html. Can you explain why? I thought I would need grep -v and some .* somewhere to make this work.

Remember when fooling around that in the general case, when usage gets serious, arbitrary HTML cannot be parsed with regular expressions. https://stackoverflow.com/a/1758162/340790 — JdeBP, Jun 02 '20 at 06:42

score 15 · Answer 1 · edited Jun 11 '20 at 14:16

From GNU grep manual:

\<
Match the empty string at the beginning of word.

\>
Match the empty string at the end of word.

This is also relevant [emphasis mine]:

-w
--word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word constituent characters are letters, digits, and the underscore. […]

Because the -w option can match a substring that does not begin and end with word constituents, it differs from surrounding a regular expression with \< and \>. For example, although grep -w @ matches a line containing only @, grep '\<@\>' cannot match any line because @ is not a word constituent. […]

And for completeness:

The caret ^ and the dollar sign $ are meta-characters that respectively match the empty string at the beginning and end of a line. They are termed anchors, since they force the match to be “anchored” to beginning or end of a line, respectively.

The pattern you used (^\<) matches the beginning of a line just before a word constituent character. Neither < character nor whitespace is a word constituent.

Note a whitespace at the beginning of a line will not trigger a match, regardless if there's a tag or something else right after. Some characters valid for text lines will not trigger a match either (e.g. ().

@pLumo, don't even need the brackets, just `grep '<'` is enough as `<` isn't special in itself. Aside: Perl is nicer here, in that backslashing a non-letter always makes it match literally, and all the magic matches are backslash-letter. — ilkkachu, Jun 02 '20 at 06:05
@ikkachu The Perl rule is much better in this regard, but you've omitted half of it: Any escaped special char is literal, and any un-escaped alphanumeric is literal. Don't forget that `+` and `(` are non-literal in Perl. — jpaugh, Jun 02 '20 at 18:35

Why does this grep statement do the opposite of what I expect?

1 Answers1