9

I have a mkdocs instance and am writing a script to print internal links in a page. I cannot get grep to print only the matches if there are multiple per line.

This is what I currently have:

$ grep -Eon '\[([[:alpha:]]|[[:digit:]]|[[:space:]])*\]\((\/|\.).*\)' /path/to/file.md
10:[foo](../../relative_path/foobar.md) is the path to another file, also see [bar](/absolute/path/foobar.md)

I would like the output to look like this:

10:[foo](../../relative_path/foobar.md)
10:[bar](/absolute/path/foobar.md)

Is there a way to do this in grep or even another command like awk or sed?

Jake
  • 101
  • 1
  • 2

4 Answers4

6
grep -Pno "[[[:alnum:]]*]\(.*?\)" /path/to/file.md

OR even better( this would match even ["foo anotherword"])

grep -Pno "\[([[:alnum:]]*[[:space:]]*)*?\]\(.*?\)"

-P => Perl Regex which is used to match non-greedy using ?

OR if don't want only alpha numeric and space but any character means

 grep -Pno "\[.*?\]\(.*?\)"
4
\[([[:alpha:]]|[[:digit:]]|[[:space:]])*\]

would match [foo], that is OK. The mistake is that after it comes:

\((\/|\.).*\)

You need to be careful when you include .* in your regexes, because it is very, very greedy! That will match (../../relative_path/foobar.md) is the path to another file, also see [bar](/absolute/path/foobar.md). Concatenating, the whole line has been matched.

You should go for

grep -Eon '\[([[:alnum:]]|[[:space:]])*\]\((\.|\/)[^)]*\)'

The key was to replace .* by [^)]*, requiring the latter regex to stop short when if a closing parenthesis comes in its way. Also, I've applied this change:

  • [[:alpha:]]|[[:digit:]] can be collapsed into [[:alnum:]]

Output:

1:[foo](../../relative_path/foobar.md)
1:[bar](/absolute/path/foobar.md)

(I have 1: instead of 10: because it is the first line in my file.)

Quasímodo
  • 18,625
  • 3
  • 35
  • 72
3
grep -on '\[[^]]*\]([^)]*)'

May just be enough in your case. Do you really need to restrict what characters may occur within [...] and (...)?

If you want to require the part inside [...] to only be made of alnums or whitespace and the part inside (...) to start with either a / or a ., that would simply be:

grep -on '\[[[:alnum:][:space:]]*\]([./][^)]*)'

In any case, note the [^)]* instead of .*) as .* would swallow the closing ) and everything up to the right-most ) on the line.

No need for -E's | alternation operator here. To match a single character, you can use the [set] bracket expression, where the set can include several characters or character classes (here [:alnum:], short for [:alpha:][:digit:] and [:space:]).

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
2

Need to use non-greedy grep

Added ? after .* in ((\/|\.).*?\)

grep -Pon '\[([[:alpha:]]|[[:digit:]]|[[:space:]])*\]\((\/|\.).*?\)' /path/to/file.md

10:[foo](../../relative_path/foobar.md)
10:[bar](/absolute/path/foobar.md)
  • -P for non greedy support. The regex should be in perl syntax
binarysta
  • 2,912
  • 10
  • 14