How can I fix this large email dataset?

Question

I have a very large dataset which is supposed to consist of emails. However, there are a large amount of invalid emails that need to be removed from the file completely.

Here are some examples:

89 is @msn .com
[email protected]
89%@yahoo.com
89%[email protected]
89&#39:[email protected]
89&#39;[email protected]
89&#39;[email protected]
89&[email protected]
89+475asdjkl:[email protected]
89+475asdjkl;[email protected]
[email protected]

Is there a simple approach available to remove lines which contain invalid emails from the file?

That last one definitely is not invalid. I'm not exactly sure about all the others. — ilkkachu, Jan 15 '18 at 13:36
Fascinated to see that "[How to do nothing forever...](https://unix.stackexchange.com/questions/42901/how-to-do-nothing-forever-in-an-elegant-way)" has appeared in the Related list on the sidebar :-) — roaima, Jan 15 '18 at 13:47
I have made many attempts, but completely failed, so thought I would ask here. — user270600, Jan 15 '18 at 13:58
Judging by wikipedia and what I can gather from the RFC, the regex behing that link isn't even correct, since it doesn't accept local-parts that are only partially quoted... — ilkkachu, Jan 15 '18 at 13:59
+ and & and # and % are allowed. User%[email protected] is pretty common. — Mark Plotnick, Jan 15 '18 at 14:19
Simple approach: send an email to each address and wait for a bounce. Bounce = invalid; no bounce = valid — Jeff Schaller, Jan 15 '18 at 15:22
@JeffSchaller as long as it's the right sort of bounce (i.e. not a _temporary_ rejection during the SMTP conversation) — roaima, Jan 16 '18 at 00:12

score 0 · Answer 1 · edited Sep 07 '18 at 23:38

EDIT: As pointed out by @Ivanivan, we could just use this regex in a grep instead of scripting anything:

grep "^[a-z0-9!#\$%&'*+/=?^_\`{|}~-]+(\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*@([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]([a-z0-9-]*[a-z0-9])?\$" my_email_list.txt >> my_valid_emails.txt

A simple script can sort this for you. As commented above by @ilkkachu and @Mark Plotnick, some of those examples are perfectly valid email addresses.

email_validate.sh:

#!/bin/bash

# email regex check
email_valid="^[a-z0-9!#\$%&'*+/=?^_\`{|}~-]+(\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*@([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]([a-z0-9-]*[a-z0-9])?\$"

# set field separator to new lines
IFS=$'\n' 
# for loop checking line against regex above
for line in $(cat my_email_list.txt); do
    if [[ $line =~ $email_valid ]]; then
        echo "$line is valid"
    else
        echo "$line is invalid"
    fi
done

example output:

┌─[root@Fedora]─[~]─[03:27 pm]
└─[$]› ./email_validate.sh
89 is @msn .com is invalid
[email protected] is valid
89%@yahoo.com is valid
89%[email protected] is valid
89&#39:[email protected] is invalid
89&#39;[email protected] is invalid
89&#39;[email protected] is invalid
89&[email protected] is valid
89+475asdjkl:[email protected] is invalid
89+475asdjkl;[email protected] is invalid
[email protected] is valid

if you need them deleting from the file as it runs through, just add a sed '/$line/d' to the if statement. Though I would personally recommend moving valid emails to a new file instead, in case you need to refer to the old

    if [[ $line =~ $email_valid ]]; then
        echo "$line is valid"
        echo "$line" >> my_valid_emails.txt
    else
        echo "$line is invalid - deleting"
    fi

Which will return something like this:

┌─[root@Fedora]─[~]─[03:34 pm]
└─[$]› cat my_valid_emails.txt
[email protected]
89%@yahoo.com
89%[email protected]
89&[email protected]
[email protected]

Be easier to use your regex and a `grep` statement to extract all matching lines and redirect output to a new file. `grep ^[a-z0-9!#\$%&'*+/=?^_\`{|}~-]+(\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*@([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]([a-z0-9-]*[a-z0-9])?\$ filename >> valid_emails` — ivanivan, Jan 15 '18 at 16:15
ha! good point! I'm always over-complicating things.. will edit and credit :) — RobotJohnny, Jan 15 '18 at 16:20
I think the local part can include “ double quoted strings “ to avoid problems with odd characters.. https://tools.ietf.org/html/rfc5322#section-3.4.1 — Guy, Jan 15 '18 at 22:57
I definitely think this is over-complicated, and incorrect in some cases. The general way to iterate through lines is `while IFS= read -r line`. Using `for` [isn't recommended](http://mywiki.wooledge.org/DontReadLinesWithFor) (and there's also a useless `cat` in your code). In any case, given that the question just asks to delete the lines, there's no reason to nest the `sed` in the `if`. Just use plain `sed` (or `grep -v`). — Sparhawk, Sep 07 '18 at 23:56

How can I fix this large email dataset?

1 Answers1