-2

I have a very large dataset which is supposed to consist of emails. However, there are a large amount of invalid emails that need to be removed from the file completely.

Here are some examples:

89 is @msn .com
[email protected]
89%@yahoo.com
89%[email protected]
89&#39:[email protected]
89'[email protected]
89'[email protected]
89&[email protected]
89+475asdjkl:[email protected]
89+475asdjkl;[email protected]
[email protected]

Is there a simple approach available to remove lines which contain invalid emails from the file?

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250

1 Answers1

0

EDIT: As pointed out by @Ivanivan, we could just use this regex in a grep instead of scripting anything:

grep "^[a-z0-9!#\$%&'*+/=?^_\`{|}~-]+(\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*@([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]([a-z0-9-]*[a-z0-9])?\$" my_email_list.txt >> my_valid_emails.txt

A simple script can sort this for you. As commented above by @ilkkachu and @Mark Plotnick, some of those examples are perfectly valid email addresses.

email_validate.sh:

#!/bin/bash

# email regex check
email_valid="^[a-z0-9!#\$%&'*+/=?^_\`{|}~-]+(\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*@([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]([a-z0-9-]*[a-z0-9])?\$"

# set field separator to new lines
IFS=$'\n' 
# for loop checking line against regex above
for line in $(cat my_email_list.txt); do
    if [[ $line =~ $email_valid ]]; then
        echo "$line is valid"
    else
        echo "$line is invalid"
    fi
done

example output:

┌─[root@Fedora]─[~]─[03:27 pm]
└─[$]› ./email_validate.sh
89 is @msn .com is invalid
[email protected] is valid
89%@yahoo.com is valid
89%[email protected] is valid
89&#39:[email protected] is invalid
89'[email protected] is invalid
89'[email protected] is invalid
89&[email protected] is valid
89+475asdjkl:[email protected] is invalid
89+475asdjkl;[email protected] is invalid
[email protected] is valid

if you need them deleting from the file as it runs through, just add a sed '/$line/d' to the if statement. Though I would personally recommend moving valid emails to a new file instead, in case you need to refer to the old

    if [[ $line =~ $email_valid ]]; then
        echo "$line is valid"
        echo "$line" >> my_valid_emails.txt
    else
        echo "$line is invalid - deleting"
    fi

Which will return something like this:

┌─[root@Fedora]─[~]─[03:34 pm]
└─[$]› cat my_valid_emails.txt
[email protected]
89%@yahoo.com
89%[email protected]
89&[email protected]
[email protected]
Rui F Ribeiro
  • 55,929
  • 26
  • 146
  • 227
RobotJohnny
  • 1,021
  • 8
  • 18
  • 1
    Be easier to use your regex and a `grep` statement to extract all matching lines and redirect output to a new file. `grep ^[a-z0-9!#\$%&'*+/=?^_\`{|}~-]+(\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*@([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]([a-z0-9-]*[a-z0-9])?\$ filename >> valid_emails` – ivanivan Jan 15 '18 at 16:15
  • ha! good point! I'm always over-complicating things.. will edit and credit :) – RobotJohnny Jan 15 '18 at 16:20
  • 1
    I think the local part can include “ double quoted strings “ to avoid problems with odd characters.. https://tools.ietf.org/html/rfc5322#section-3.4.1 – Guy Jan 15 '18 at 22:57
  • I definitely think this is over-complicated, and incorrect in some cases. The general way to iterate through lines is `while IFS= read -r line`. Using `for` [isn't recommended](http://mywiki.wooledge.org/DontReadLinesWithFor) (and there's also a useless `cat` in your code). In any case, given that the question just asks to delete the lines, there's no reason to nest the `sed` in the `if`. Just use plain `sed` (or `grep -v`). – Sparhawk Sep 07 '18 at 23:56