-2

Hello I have the below awk in my script . The regex pattern is not working correctly for me .I wanted to validate the email address which can have characters [a-z],[0-9] ,[.] ,@

code

here are the sample email patterns in the input file
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

the pattern is extracted from a metadata file and passed as a script paramter .here is the metadata line defines the pattern for email id validation

1~4~~~char~Y~\"\@\.com\"~100

sh -x run for the script code

val=$(
     awk -F , 
         -v n=4
         -v 'm="*@*.com"'
         -v count=0 
         'NR!=1 && $n !~ "^" m "$"
                      {
                         printf "%s:%s:%s\n", FILENAME, FNR, $0 > "/dev/stderr"
                         count++
                       }
                       END {print count}' BNC.csv

vi of the script code

val=$(awk -F "$sep"
        -v n="$col_pos" 
        -v m="$col_patt" 
        -v count=0 
        'NR!=1 && $n !~ "^" m "$" 
                       {
                         printf "%s:%s:%s\n", FILENAME, FNR, $0 > "/dev/stderr"
                         count++
                       }
                       END {print count}' $input_file 
  • 2
    Although `*@*.com` is a valid regex, it probably doesn't do what you expect. See [How do regular expressions differ from wildcards used to filter files](https://unix.stackexchange.com/questions/57957/how-do-regular-expressions-differ-from-wildcards-used-to-filter-files) – steeldriver May 20 '21 at 18:29
  • @steeldriver thank you it worked – daturm girl May 20 '21 at 18:36
  • you now plus this question have asked 5 questions for the same thing (validating mail) each with some small changes (or add a bit new requirement). what is your final goal?! – αғsнιη May 20 '21 at 18:41
  • 2
    As well, note that the task is much harder than it at first appears - see for example [How to validate an email address using a regular expression?](https://stackoverflow.com/a/201378/4440445) and the previous discussion on this site [Why this regex pattern for email is so popular when it does not even take in to consideration for lower case letters?](https://unix.stackexchange.com/questions/609471/why-this-regex-pattern-for-email-is-so-popular-when-it-does-not-even-take-in-to) – steeldriver May 20 '21 at 19:42
  • 1
    You can edit questions if you feel that they need to be improved. – FairOPShotgun May 20 '21 at 20:46
  • @αғsнιη Sorry for the multiple question :) . My final goal is to develop a generic flat file validation script , which has diff requirements . the challenges i had given was related to column datatype validations . i have a metadata file which defines the details of columns in a file. i need to read the metadata file and validate it against incoming files eg if i define a 4th column of file file1.txt as email text then it should match with that regex, if i define it with number it should match with number. i have completed my coding hopefully no more questions – daturm girl May 20 '21 at 20:59
  • 3
    There you go with the weird code layout again but this time it's actually breaking your code as newlines between a condition (`NR!=1 && $n !~ "^" m "$"`) and associated action (`{ printf ... }`) matter. Please see the example [I provided for you previously](https://unix.stackexchange.com/q/650333/133219) for one of the few common, legible ways to format your code or, again, just run it through `gawk -o-` and gawk will format it for you. Also, copy/paste your code into http://shellcheck.net and it'll tell you about the shell errors in it. – Ed Morton May 20 '21 at 23:08
  • 1
    Does this answer your question? [Why this regex pattern for email is so popular when it does not even take in to consideration for lower case letters?](https://unix.stackexchange.com/questions/609471/why-this-regex-pattern-for-email-is-so-popular-when-it-does-not-even-take-in-to) – AdminBee May 21 '21 at 09:31
  • Yes this article answered my question – daturm girl May 21 '21 at 13:30
  • 1
    @EdMorton Thanks for the Format tip . I used heavily in script ur tip. while copy pasted the same in the chat it looked very weird and hence i ended up in manually formatting . But i will follow what you said in future listings – daturm girl May 21 '21 at 13:36

1 Answers1

1

If you're looking for a way to validate email addresses, FWIW this is what I have in an old awk script I have lying around that does that:

    # valid addrs regexp from http://www.regular-expressions.info/email.html
    # Specifically do NOT want to use [:alpha:] to drop Asian characters etc
    # Added a check that we have at least 2 consecutive alphabetic characters
    # both before and after the "@" to get rid of [email protected] etc. garbage
    (addr ~ /^[0-9a-zA-Z._%+-]+@[0-9a-zA-Z.-]+\.[a-zA-Z]{2,}$/) &&
    (addr ~ /^.*[a-zA-Z]{2}.*@.*[a-zA-Z]{2}.*\.[a-zA-Z]{2,}$/)

I'm sure that could be consolidated into 1 regexp but I don't care enough to do it and the end result would probably be less clear anyway.

Ed Morton
  • 28,789
  • 5
  • 20
  • 47
  • I was trying to understand why the pattern *@*.com didnt worked in the awk . Then steel driver showed me a god example of file globbing and regex pattern example . But surprise for me is email id validation is far beyond what it is expected :) – daturm girl May 21 '21 at 18:14
  • Well, I don't know what other issues you experienced but a couple of obvious things are that `.` is a regexp metachar meaning any character so `@.com` would match `@xcom`, not just `@.com`, but more importantly valid email addresses don't contain `@.com`, they contain `@.com`. If you still have a question, though, then please update your question to contain concise, testable sample input and expected output that demonstrates the problem and reasonably formatted code plus a clear statement of what exactly the problem is, not just "The regex pattern is not working correctly for me". – Ed Morton May 21 '21 at 18:33
  • And yes, the set of valid email addresses is hard to validate (see https://stackoverflow.com/a/201378/1745001) but what I have in my answer is adequate for most purposes for people using the Roman alphabet. – Ed Morton May 21 '21 at 18:38
  • Also, when you talk about a regexp that matches an email address there's a couple of use-cases: a) finding email addresses in a block of arbitrary text, or b) validating that a specific string that should be an email address actually is one. You would write different code with slightly different regexps for each case. – Ed Morton May 21 '21 at 18:49
  • And there's the additional consideration of the text matching a regexp but it not actually being a valid email address, e.g. "[email protected]" should match an email regexp but chances are there is no ".foo" TLD and if there is "a562nd71jd651j5l.foo" probably isn't a real domain at that TLD and if it is "k8e7klo9@" probably isn't a real username at that domain. – Ed Morton May 21 '21 at 18:51