-1

I need some help or advice wrt awk and its use of regular expressions. I have a data input file with an irregular structure. To parse this file correctly I need to recognize a line of the following form:

@ 8/1/17, 10:04 PM  

A line with this pattern marks the end of a complete transaction. It's simply a date & time stamp preceded by a space and the @ character.

I've cobbled a regular expression that seems to match in "most" usage:

\W\@\W[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{1,2}\,\W[0-9]{1,2}\:[0-9]{2}\W[AP]M  

However, it does not seem to match when used in the following awk statement:

$ awk 'match($0, /\W\@\W[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{1,2}\,\W[0-9]{1,2}\:[0-9]{2}\W[AP]M/) {print $0}' testfile2.txt

My system (macos mojave) has an old version of awk awk version 20070501.

I've also found:

  • grep -e fails to match this pattern to any line in testfile2.txt, but egrep and grep -E do match the lines I expected them to match.

  • awk 'match($0, /\@/) {print $0}' testfile2.txt does match (& print) the expected lines, but I can't rely on a single character!

Here's testfile2.txt:

+13054261988: Forwarding data to primary repository
@ 1/7/18, 4:21 PM
+16744774911: Use this URL: https://www.repo-prime.ga/
@ 1/7/18, 4:22 PM
+13054261988: Will do. Passwords OK?
@ 1/7/18, 6:12 PM
+16744774911: No, use 2FA for all transactions
@ 1/7/18, 8:56 PM
+13054261988: Using Google's authenticator?

If so, will need additional information.
@ 1/7/18, 9:36 PM
+13054261988: RSVP ASAP, I have transactions that need to be uploaded.
@ 1/7/18, 9:46 PM

Is my regular expression failing to match in awk usage due to an error I can't see in my awk statement, or is it due to the regex itself, a combination of both, etc?

roaima
  • 107,089
  • 14
  • 139
  • 261
Seamus
  • 2,522
  • 1
  • 16
  • 31
  • 1
    Related: [Why does my regular expression work in X but not in Y?](https://unix.stackexchange.com/questions/119905/why-does-my-regular-expression-work-in-x-but-not-in-y) – steeldriver Oct 31 '19 at 18:11
  • What does "seems to match in "most" usage" mean? How did you test it? – ilkkachu Oct 31 '19 at 18:47
  • @ilkkachu: It works in `egrep` and `grep -E`, but not `grep -e` on macos. Also works in 'BBEdit v 13'. – Seamus Oct 31 '19 at 18:54
  • @steeldriver: That seems to be it. `macos` has a different set of expressions than my Linux distro (where something similar worked). Actually, it seems that `macos` may be inconsistent between (for example) `awk` and `egrep`/`grep -E`. Making progress now! – Seamus Oct 31 '19 at 19:03
  • 1
    `awk '$1=="@" && $4 ~/^[AP]M$/'` or `awk '/@.*\<[AP]M\>/'` could probably be enough... – JJoao Oct 31 '19 at 19:50

2 Answers2

1
  • why strictly matching /\W (non-word character) before @ ? as in your text file @ is at start of the line
  • no need to escape the chars as \@, \,, : (they are not special chars)
  • calling match() is redundant if only need to match a pattern

$ awk '/^@ [0-9]{1,2}\/[0-9]{1,2}\/[0-9]{1,2}, [0-9]{1,2}:[0-9]{2} [AP]M/' file
@ 1/7/18, 4:21 PM
@ 1/7/18, 4:22 PM
@ 1/7/18, 6:12 PM
@ 1/7/18, 8:56 PM
@ 1/7/18, 9:36 PM
@ 1/7/18, 9:46 PM
RomanPerekhrest
  • 29,703
  • 3
  • 43
  • 67
  • I need this to work with the `match` stmt as it's part of a larger script. – Seamus Oct 31 '19 at 18:41
  • Doesn't work on macos: $ awk 'match($0, //^@ [0-9]{1,2}\/[0-9]{1,2}\/[0-9]{1,2}, [0-9]{1,2}:[0-9]{2} [AP]M/'/) {print $0}' testfile2.txt -bash: syntax error near unexpected token `)' – Seamus Nov 01 '19 at 01:21
  • Also doesn't work on macos: $ awk '/^@ [0-9]{1,2}\/[0-9]{1,2}\/[0-9]{1,2}, [0-9]{1,2}:[0-9]{2} [AP]M/' testfile2.txt gives no output at all – Seamus Nov 01 '19 at 01:21
1

Seems that very old versions of awk had not {…} capability.

This older regex syntax should match in any awk:

awk '/@ [0-9][0-9]?\/[0-9][0-9]?\/[0-9][0-9]?, [1-2]?[0-9]:[0-6][0-9] [AP]M/' file

If your awk could match bracket expressions like [[:blank:]], the regex could be made to be a little more flexible:

awk '/@[[:blank:]][0-9][0-9]?\/[0-9][0-9]?\/[0-9][0-9]?,[[:blank:]][1-2]?[0-9]:[0-6][0-9][[:blank:]][AP]M/' file

If matching one (or more) digits is good enough (I can't see why not), you can use the shorter regex:

awk '/@ [0-9]+\/[0-9]+\/[0-9]+, [1-2]?[0-9]:[0-6][0-9] [AP]M/' file

And you can add start ^ and end $ to make the regex quite more restrictive, if needed.

I am not using match for such a simple matching of a line, but the same regex would work perfectly fine with that function.

  • This works on mac os. Also [see this answer](https://stackoverflow.com/a/28256343/5395338), and this [regex tutorial](https://www.regular-expressions.info/posixbrackets.html) for a more detailed explanation of @Isaac use of the `[[:blank:]]` character class, and the related `[[:space:]]` class. Note POSIX compliance :) – Seamus Nov 01 '19 at 01:18