3

I think I have a very similar question to this one but I see it was closed due to being unclear so I'll create a new question.

I've got a log file that contains one-line entries with multiple details.

For example:

Mon Jan 22 12:12:12 2012 foo=blah   foo2=blah2  foo3=Some longer sentence that can contain spaces and numbers   somethingelse=blarg   foo5=abcdefg
Mon Jan 22 12:13:12 2012 foo=blah   foo2=blah3  foo3=another long sentence that could be the same or different that the prior log entry   somethingelse=blarg   foo5=112345abcdefg
Mon Jan 22 12:14:12 2012 foo=blah   foo2=blah2  foo3=Foo923847923874Some longer sentence that can contain spaces and numbers   somethingelse=blarg   foo5=abcdefg
Mon Jan 22 12:15:12 2012 foo=blah   foo2=blah2  foo3=Fooo02394802398402384Some longer sentence that can contain spaces and numbers   somethingelse=blarg   foo5=abcdefg

I want to extract just the content value for foo3. In other words, I want to see everything right after foo3= but right before somethingelse=

I was thinking I could do something like grep -oP 'foo3=[\s\S]*somethingelse='but the regex is too greedy and eventually results in a "Aborted (core dumped) error. Is there a more efficient way of doing this?

Additional notes:

  • This log file is large and has 40,000+ lines in it.
Mike B
  • 8,769
  • 24
  • 70
  • 96

3 Answers3

4

If there is only one foo3 in line

sed -n '/foo3=/{s/.*foo3=//;s/\S*=.*//;p}' file.txt

Suppress printing any line (-n options) exept which pushed by p. For lines which consists foo3=:

  1. Exchange everything before foo3= with it included (.*foo3=) to nothing (//).
  2. Remove everything which starts with some(*) non-space (\S) symbols with =.
  3. Prints resedue after two substitution (p).

Other

sed -n 's/.*foo3=\([^=]*\)\s\+\S*=.*/\1/p' file.txt

Exchange full line for pattern (\1) in parenthesis (\(...\)) which consist any symbols exept = and lay after foo3= and before some (*) spaces (\s) then some non-spaces with = and prints resedue of lines where such substitution has been done only.

Costas
  • 14,806
  • 20
  • 36
  • Bingo. That works. – Mike B Mar 27 '15 at 23:29
  • If I may be so bold, could you consider explaining the logic of the command (or pointing me to a reference for `sed` expression scripts)? I want to understand what this is doing so I can avoid asking similar questions in the future. – Mike B Mar 27 '15 at 23:41
  • @mikeserv Nice look. But `/.*foo3=/{s///;s/[^ ]*=.*//;p}` – Costas Mar 28 '15 at 00:10
2
sed '/^foo3=/P;/\n/!s/[^ ]\{1,\}=/\n&/g;D' <infile >outfile

You may have to use a literal newline in place of the n above, but this will print only the contents between foo3 and foo4.

For faster processing, get more explicit about it:

sed '/\n/s/ [^ ]*=.*//p;/\n/!s/foo3=/\n\n&/;D' | grep .

Or with an extra grep the top can be much faster as well:

sed 's/[^ ]\{1,\}=/\n&/g' | grep '^foo3='
mikeserv
  • 57,448
  • 9
  • 113
  • 229
  • That looks promising but the names of the string markers are quite different from one another so I don't think the `[34]` would work. I'll revise my example above to make that more clear. – Mike B Mar 27 '15 at 22:42
  • @MikeB - ok, dropped the check for foo4 entirely - it will just sandwich the first occurrence of `foo3=` plus any/all following characters which are not a space between two newlines, the `D`elete up to the first occurring newline, and, when the cycle renews, `P`rint up to the first occurring newline *(if there is one at all)*. nevermind - foo4 can have a space. – mikeserv Mar 27 '15 at 22:48
  • @MikeB - ok, it no longer cares about spaces except that some sequence of not-spaces is followed by an equals sign. Now it tacks a newline on before every occurrence of some sequence of not spaces followed by an = sign, then does the `D`elete as before. It only `P`rints when `foo3=` is at the head of the line - so it will have to do a couple of `D`eletes before getting there, but it won't do the substitution if there's a newline already on the lin. – mikeserv Mar 27 '15 at 22:56
  • I'm getting an error: `sed: -e expression #1, char 34: Unmatched \{` – Mike B Mar 27 '15 at 22:57
  • @MikeB - oops - it's because I didn't match it. Now I have done. If you're ever unfortunate enough to get as familiar with `sed` as I have, you'll find you can just tell automatically how it will work without testing, but that also means you might overlook the very simple stuff *(like a missing backslash)* when writing it out. – mikeserv Mar 27 '15 at 22:58
  • Hmm... I suspect that will work. sed is now churning away at 25% cpu (and has been for the past 10 minutes). I'll let it do it's thing and report back when it finishes. – Mike B Mar 27 '15 at 23:11
  • Ooh. That's no good. I can make it *much* faster. Try the bottom one. Add a `100q` or something to its head when testing so you don't get mired just in case. I don't know why the other one would get caught? 10 mins doesn't seem right - not for 40k lines. – mikeserv Mar 27 '15 at 23:12
  • I'm marking @Costas's answer as the "accepted solution" but want to thank you again for your time, efforts, and especially patience. It was greatly appreciated. – Mike B Mar 27 '15 at 23:33
  • @MikeB - that's completely fine with me. It's a good answer - though if you're using anything other than a GNU `sed` you'll want to do some syntax edits in order for it to work. – mikeserv Mar 27 '15 at 23:34
1

Try this:

$ grep -Po "(?<=foo3\=).*(?=\s*foo4)" file.txt
heemayl
  • 54,820
  • 8
  • 124
  • 141