I am in the process of trying to upload around 20 years worth of Usenet archives to archive.org but my first batch got reject because some of the archives contained trojans that are encoded in base64. Since I have around 400GB of files to work with, fixing things manually is out of the question. All of the files are in mbox format which is plain text. My first thought was to find and replace all messages in the mbox file containing "Content-Type: application/x-msdownload". That could be pretty difficult. I'm now thinking that an easier brute force method would be to delete all base64 blocks.
From this question, I see that it's possible to find base64 blocks with grep, but I don't know how to set up the same thing with sed and that's why I'm asking. Thanks!
Edit: what I've tried so far
According to this page, ^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$ should be the regex needed to find base64 text, but when I try to use that with sed, it doesn't actually work or at least it doesn't do what I expect.
Example:
cat clari.local.california.sfbay.biz.mbox | sed -e '#^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$#d' > clari.local.california.sfbay.biz.mbox.test
clari.local.california.sfbay.biz.mbox.test still contains the base64 text.