2

So, it took me ages, but I finally learned to think in terms of regular expressions, thanks to using them in kwrite.

But I still don't know how to translate that knowledge to grep. I love my grep, when I know what I'm doing with it, but the manual has always given me a headache.

I'd like to match stuff like the following lines:

CAPITALSFOLLOWING anewline.
CAPI
TALSFOLL owing
ANEW line.

That is, lines that begin with two or more capital letters. But I can't figure out how.

In kwrite, I would match these lines using:

\n[A-Z][A-Z]+

But grep... hmm. I have a feeling like it's something like:

me@ROOROO:~/$ grep "^[A-Z]something" filename

but

me@ROOROO:~/$ grep "^[A-Z][A-Z]+" filename

doesn't work (returns an empty file). A google search for the term 'grep match one or more occurrence' lead me to believe that

me@ROOROO:~/$ grep "^[A-Z][A-Z]*" filename

was the right syntax. But, alas, that doesn't do the trick.

ixtmixilix
  • 13,040
  • 27
  • 82
  • 118
  • In the old days, each tool had its own regexp syntax. By default, `grep` uses its traditional syntax; use `grep -E` to have a more habitual syntax where a backslash followed by a non-alphanumeric character is never special. – Gilles 'SO- stop being evil' Feb 10 '12 at 23:47

3 Answers3

8

You're using the right syntax in your first example; the problem is + is only considered special when using "extended" regular expressions. From the man page of the GNU implementation of grep:

Basic vs Extended Regular Expressions

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

(\?, \+, and \| are non-standard GNU extensions though).

So, you either need to escape the + (assuming GNU grep or compatible):

$ grep "^[A-Z][A-Z]\+" filename

Use the standard \{1,\} equivalent of GNU's \+:

$ grep '^[A-Z][A-Z]\{1,\}' filename

or even here:

$ grep '^[A-Z]\{2,\}' filename

Or turn on extended regular expressions, by passing grep the -E flag or just running egrep (egrep is the command that introduced those extended regular expressions in the late 70s):

$ grep -E "^[A-Z][A-Z]+" filename
$ egrep "^[A-Z][A-Z]+" filename

In any case, all those would be functionally equivalent to:

$ grep '^[A-Z][A-Z]' filename

So you don't even need the + operator.

In your other example you tried:

$ grep "^[A-Z][A-Z]*" filename

* works in basic regular expressions, but it matches 0 or more times, not 1 or more. The solution in your answer works because it says "match a capital, then another capital, then 0 or more capitals". The method in the question says "match a capital, then 1 or more capitals", which is the same. You can also use {min,max} to specify exactly how many you want, and if you leave out max it allows any number (this also requires extended regular expressions):

$ egrep "^[A-Z]{2,}"

(as a history note, egrep didn't support {min,max} initially (and still doesn't in Solaris 11 /bin/egrep for instance). \{min,max\} support was added to grep before {min,max} was added to egrep (which in the case of egrep did break backward compatibility)).

Michael Mrozek
  • 91,316
  • 38
  • 238
  • 232
1

You just need to add an extra [A-Z]. So, it's

me@ROOROO:~/$ grep "^[A-Z][A-Z][A-Z]*" filename
ixtmixilix
  • 13,040
  • 27
  • 82
  • 118
0

Looks like you need a regexp support from perl. Form man grep:

   -P, --perl-regexp
          Interpret  PATTERN  as  a Perl regular expression.  This is highly experimental
          and grep -P may warn of unimplemented features.

So grep -P "^[A-Z][A-Z]+" could be more helpful.