How can I extract the text between two strings in a log file?

Question

I think I have a very similar question to this one but I see it was closed due to being unclear so I'll create a new question.

I've got a log file that contains one-line entries with multiple details.

For example:

Mon Jan 22 12:12:12 2012 foo=blah   foo2=blah2  foo3=Some longer sentence that can contain spaces and numbers   somethingelse=blarg   foo5=abcdefg
Mon Jan 22 12:13:12 2012 foo=blah   foo2=blah3  foo3=another long sentence that could be the same or different that the prior log entry   somethingelse=blarg   foo5=112345abcdefg
Mon Jan 22 12:14:12 2012 foo=blah   foo2=blah2  foo3=Foo923847923874Some longer sentence that can contain spaces and numbers   somethingelse=blarg   foo5=abcdefg
Mon Jan 22 12:15:12 2012 foo=blah   foo2=blah2  foo3=Fooo02394802398402384Some longer sentence that can contain spaces and numbers   somethingelse=blarg   foo5=abcdefg

I want to extract just the content value for foo3. In other words, I want to see everything right after foo3= but right before somethingelse=

I was thinking I could do something like grep -oP 'foo3=[\s\S]*somethingelse='but the regex is too greedy and eventually results in a "Aborted (core dumped) error. Is there a more efficient way of doing this?

Additional notes:

This log file is large and has 40,000+ lines in it.

Costas · Accepted Answer · 2015-03-28T00:04:18.853

4

If there is only one foo3 in line

sed -n '/foo3=/{s/.*foo3=//;s/\S*=.*//;p}' file.txt

Suppress printing any line (-n options) exept which pushed by p. For lines which consists foo3=:

Exchange everything before foo3= with it included (.*foo3=) to nothing (//).
Remove everything which starts with some(*) non-space (\S) symbols with =.
Prints resedue after two substitution (p).

Other

sed -n 's/.*foo3=\([^=]*\)\s\+\S*=.*/\1/p' file.txt

Exchange full line for pattern (\1) in parenthesis (\(...\)) which consist any symbols exept = and lay after foo3= and before some (*) spaces (\s) then some non-spaces with = and prints resedue of lines where such substitution has been done only.

edited Mar 28 '15 at 00:04

answered Mar 27 '15 at 23:26

Costas

14,806
20
36

Bingo. That works. – Mike B Mar 27 '15 at 23:29
If I may be so bold, could you consider explaining the logic of the command (or pointing me to a reference for `sed` expression scripts)? I want to understand what this is doing so I can avoid asking similar questions in the future. – Mike B Mar 27 '15 at 23:41
@mikeserv Nice look. But `/.*foo3=/{s///;s/[^ ]*=.*//;p}` – Costas Mar 28 '15 at 00:10

mikeserv · Answer 2 · 2015-03-27T23:14:10.683

2

sed '/^foo3=/P;/\n/!s/[^ ]\{1,\}=/\n&/g;D' <infile >outfile

You may have to use a literal newline in place of the n above, but this will print only the contents between foo3 and foo4.

For faster processing, get more explicit about it:

sed '/\n/s/ [^ ]*=.*//p;/\n/!s/foo3=/\n\n&/;D' | grep .

Or with an extra grep the top can be much faster as well:

sed 's/[^ ]\{1,\}=/\n&/g' | grep '^foo3='

edited Mar 27 '15 at 23:14

answered Mar 27 '15 at 22:27

mikeserv

57,448
9
113
229

That looks promising but the names of the string markers are quite different from one another so I don't think the `[34]` would work. I'll revise my example above to make that more clear. – Mike B Mar 27 '15 at 22:42
@MikeB - ok, dropped the check for foo4 entirely - it will just sandwich the first occurrence of `foo3=` plus any/all following characters which are not a space between two newlines, the `D`elete up to the first occurring newline, and, when the cycle renews, `P`rint up to the first occurring newline *(if there is one at all)*. nevermind - foo4 can have a space. – mikeserv Mar 27 '15 at 22:48
@MikeB - ok, it no longer cares about spaces except that some sequence of not-spaces is followed by an equals sign. Now it tacks a newline on before every occurrence of some sequence of not spaces followed by an = sign, then does the `D`elete as before. It only `P`rints when `foo3=` is at the head of the line - so it will have to do a couple of `D`eletes before getting there, but it won't do the substitution if there's a newline already on the lin. – mikeserv Mar 27 '15 at 22:56
I'm getting an error: `sed: -e expression #1, char 34: Unmatched \{` – Mike B Mar 27 '15 at 22:57
@MikeB - oops - it's because I didn't match it. Now I have done. If you're ever unfortunate enough to get as familiar with `sed` as I have, you'll find you can just tell automatically how it will work without testing, but that also means you might overlook the very simple stuff *(like a missing backslash)* when writing it out. – mikeserv Mar 27 '15 at 22:58
Hmm... I suspect that will work. sed is now churning away at 25% cpu (and has been for the past 10 minutes). I'll let it do it's thing and report back when it finishes. – Mike B Mar 27 '15 at 23:11
Ooh. That's no good. I can make it *much* faster. Try the bottom one. Add a `100q` or something to its head when testing so you don't get mired just in case. I don't know why the other one would get caught? 10 mins doesn't seem right - not for 40k lines. – mikeserv Mar 27 '15 at 23:12
I'm marking @Costas's answer as the "accepted solution" but want to thank you again for your time, efforts, and especially patience. It was greatly appreciated. – Mike B Mar 27 '15 at 23:33
@MikeB - that's completely fine with me. It's a good answer - though if you're using anything other than a GNU `sed` you'll want to do some syntax edits in order for it to work. – mikeserv Mar 27 '15 at 23:34

score 1 · Answer 3 · answered Mar 27 '15 at 22:20

1

Try this:

$ grep -Po "(?<=foo3\=).*(?=\s*foo4)" file.txt

answered Mar 27 '15 at 22:20

heemayl

54,820
8
124
141

That captures the type of text (as well as leaving out the marker names - which is great) but still eventually results in `Aborted (core dumped)`. Thanks anyway though. – Mike B Mar 27 '15 at 22:23
@MikeB: How about `grep -Po "(?<=foo3\=).*?(?=\s*foo4)" file.txt` ? – heemayl Mar 27 '15 at 22:28
Same error unfortunately. – Mike B Mar 27 '15 at 22:43
@MikeB Try `grep -oP "foo3=\K[^=]+(?=\b\w+=)" file.txt` – Costas Mar 27 '15 at 23:13
@Costas Immediately went to error for that one. – Mike B Mar 27 '15 at 23:40
@MikeB Very strange. What kind of error? Try `grep -oP 'foo3=\K[^=]+(?=\s+\S+=)' file.txt` There is no difference with sed' regexp above. What version of `grep` you use? `grep -V` – Costas Mar 27 '15 at 23:45
@Costas `Aborted (core dumped)`. – Mike B Mar 27 '15 at 23:48

How can I extract the text between two strings in a log file?

Additional notes:

3 Answers3