New Scam Email Indexing Method (again!)

It’s my third iteration on the same basic principle: take a carefully filtered and enhanced archive of 150,000 email messages and then sort, categorize and analyze them, then put them in a defanged, indexable/searchable list format so that people can browse them.

The first was a program I wrote in perl back in 2004, it was a POP sucker that connected to the mailbox, attempted to extract message parts and rewrite them as a html page. While successful, I was never happy at my efforts to disentangle nested messages and alternate body parts – this meant that a lot of emails showed up with lots of Base64 and other garbage. (eg. ScamDB_S_74.php)

The next try I had was to use a mail archive indexer program called ‘Hypermail‘. This was mostly successful at splitting messages into component parts but was still not quite flexible enough for my needs and the indexes were way too long. (eg. HYPMAIL/date.php)

So this spring, I am trying a whole new system that I rewrote in PHP, my code of choice for the decade. I am still mailbox based, mainly so that I can prune spam that has sneaked through my filters, but that may change soon.

This is how the Scamdex Engine works:

  1. Scam Emails arrive in the honeypot mailbox.
  2. Using Thunderbird with various Add-ons, I partially manually sort the scam emails into a holding mailstore and throw away the junk.
  3. A program runs nightly which:
    1. Analyses emails in the holding mailstore into one of 5 categories (419/AFF, Auctions, Jobs, Phishing, Lottery).
    2. Adds some extra Headers to the email.
    3. Moves it to the correct mailbox archive location.
    4. Runs MHONARC to create the indexed archive and html-ized emails.
    5. post-processes the MHonarc-ized pages to add a php index include file, update the (MySQL) database and  distribute the keywords  and scoring to  META and the nice little  graph widget.
    6. Our illustrious Founder
    7. err… that’s it!

It’s not pretty or fast but it works, and I can understand it. It’s easy to fix and add to. It’s annoying having to run the process every night from scratch but until I work out how to use the MHONARC system to add/delete emails from the archive, it’s all I can do. Any suggestions about how I can do this better, let me hear them!

(send to scamblog(a)

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.