1

I have a giant mbox file (8.8 GB) that contains many quoted-printable attachments, consisting of 75-character lines terminated by the "soft linebreak" sequence (equal sign, carriage return, linefeed).

I would like to search for all occurrences of a particular regular expression in the mbox content. However, any particular match may be split among multiple lines, therefore I need to delete each soft linebreak sequence before performing the regex search.

I'm having trouble figuring out the best way to do this. The solution for GNU sed that I found here fails because it apparently instructs sed to treat the entire file as one line. That one line has length is greater than INT_MAX, which causes sed to exit with an error message. I also see that there is a solution here that uses ripgrep. However, this fails to actually join the lines; the equal signs are removed but the line with the equal sign remains separate from the following line.

Brian Bi
  • 562
  • 5
  • 15
  • Maybe you can open it with `mutt` and search in it with `l` (lower case L to `limit` the display) and search with `~b your-regexp` to search on the decoded contents? – Stéphane Chazelas Apr 26 '23 at 16:23
  • What exactly means "search for all occurrences of a particular regular expression"? Display the lines that match the expression? Please add a piece of example input, the regular expression and the expected result. – Bodo Apr 26 '23 at 16:23
  • 1
    Try preprocessing with `perl -pe 's/=\r?\n//'` (` – Stéphane Chazelas Apr 26 '23 at 16:33
  • google the Unix tools `formail` and `procmail` - one of them can probably help you. – Ed Morton Apr 26 '23 at 21:33

2 Answers2

2

In the end, what you're describing that you're doing is that you parse the mbox into email contents (towards which I count their attachments), and then look through those. Proper approach!

So, do that: parse the mbox file actually as mails, and you'll be somewhat happy. Here's my top-of-head, cobbled-together, untested but not red in my editor code:

#!/usr/bin/env python3
import mailbox
import re as regex

pattern = regex.compile("BRIANSREGEX")

mb = mailbox.mbox("Briansmails.mbox", create=False)
for msg in mb:
    print(f"{msg['Subject']}")
    for part in msg.walk():
        print(f"| {part.get_content_type()}, {part.get_content_charset()}")
        print("-"*120)
        payload = part.get_payload(decode=True)
        charset = part.get_content_charset
        if type(payload) is bytes:
            content = ""
            try:
                content = payload.decode(charset)
            except:
                # print("failed to decode")
                try:
                    content = payload.decode() # try ascii
            match = pattern.search(content)
            # do what you want with that match…
            if match:
                print(f"| matched at {match.start()}")
        print("-"*120)
Marcus Müller
  • 21,602
  • 2
  • 39
  • 54
  • 1
    There are some errors here (`get_content_charset` is missing the `()` needed to call it, and we must take into account that it might return `None`) but this is a good approach overall; I realized that it also matches inside base64 attachments, not just quoted-printable ones. – Brian Bi Apr 27 '23 at 01:54
  • In addition, if `part.is_multipart()` is true, then attempting to get the payload should just be skipped (it will return `None` anyway). – Brian Bi Apr 27 '23 at 13:48
  • 1
    @BrianBi as said, that was just quickly written down without any testing or sense. – Marcus Müller Apr 27 '23 at 21:03
0

I've given Marcus Müller credit for his answer, but here's the version that I ultimately ended up using, which is based on his answer:

#!/usr/bin/env python3
import mailbox
import re
import sys

byte_pattern = re.compile(b"https?://[^/]*imgur.com/[a-zA-Z0-9/.]*")
str_pattern = re.compile("https?://[^/]*imgur.com/[a-zA-Z0-9/.]*")

mb = mailbox.mbox(sys.argv[1], create=False)
for msg in mb:
    for part in msg.walk():
        if part.is_multipart():
            continue
        payload = part.get_payload(decode=True)
        if type(payload) is bytes:
            # first, search it as a binary string
            for match in byte_pattern.findall(payload):
                print(match.decode('ascii'))
            # then, try to decode it in case it's utf-16 or something weird
            charset = part.get_content_charset()
            if charset and charset != 'utf-8':
                try:
                    content = payload.decode(charset)
                    for match in str_pattern.findall(content):
                        print(match)
                except:
                    pass
        else:
            print('failed to get message part as bytes')

Note that part may be a multipart message, i.e., not a leaf node in the tree, because walk performs a depth-first traversal that includes non-leaf nodes.

Only if it's a leaf node, I first search for my pattern as a byte string, then attempt to decode it as text using the indicated charset if it is not UTF-8. (If it is UTF-8, which is most common, it will have been found as a byte string already.)

Brian Bi
  • 562
  • 5
  • 15