Fighting Spam with Bayesian Statistics


Louisiana History
Bendel Gardens
About Us!
Fighting Spam with Bayesian Statistics
Wireless Security
Digital UV and NIR photography
Genes Geek Gadgets
Genes Soapbox
Miscelaneous Stuff


UPDATE: The war is going nicely, and I am winning (well, Bayesian statistics are).  Here are a few statistics to prove that point.

Start date of popfile : June 23, 2003
Total Messages Classified: 101,840
Classification errors: 76 (false positives)
Overall accuracy : 99.92%
Percentage of mail that was spam: 97.55%
Percentage of mail that I wanted : 2.44%

Basically, that's 99,349 messages that I never saw, never had to delete, and therefore didn't bother me. 


A little history on my quest for a spam free world:  If you don't know what spam is, consider yourself lucky.  Spam is the term used for those junk email messages that clog your mail folder, and take you bandwidth and time, and are mostly not the sort of thing you would let your kids see.  Since I work for the government, they conveniently publish lists of my email address along with every other person who has purchasing authority.  As a result, I get TONS of email.  I should mention, that everything here works via pop3.  If you have some other method, this probably isn't for you.

In the beginning of my quest, I tried simple message filters.  I.E.: if the new message contains <enter you favorite vulgar expression here> then delete it without even showing it to me.  This doesn't work, because while it does filter out spam (actually not very well any more, as spammers have begun obfuscating their text), it inadvertently deletes lots of legitimate email.  You would be absolutely amazed (as I was), to the number of vulgar words that magically appear in any tiff/gif/bmp/jpg picture (a seemingly random binary file that isn't actually human readable, however, those random bits do inevitably spell out readable vulgarities) that your friends/family happen to send you.  So, the escalation process begins... 

Bring on Mailwasher.   For a while I was really impressed with it, and I still think its the best choice for dialup users (because it works by initially grabbing only the mail headers), however, it became clear that the tide was turning and the spammers were beginning to win the war.  Mailwasher works by filtering (which doesn't work, by itself), and a combination of blacklisting and whitelisting.  If an email address/entire domain is on the blacklist/whitelist, the email is either deleted, or kept depending on the list, no matter what filter (if any) it matches.  This isn't working either because it is apparently way too easy for the spammer to get a new email address/domain with which to send me spam (not to mention, totally bogus addresses).  So, we escalate again.

The latest weapon is a really cool little program called POPFile, which basically implements Bayesian Statistics (Developed by the Reverend Thomas Bayes whose work was published in 1761, alas, three years after his death) or more importantly Bayes Theorem which shows how to calculate the probability of one event given that you know some other event has occurred (if you want to know more, just enter any of the above terms into your favorite search engine).  POPFile is slick.  The documentation is a little vague to follow (and that's saying something when it comes from me), but it isn't as difficult as it first looks, and it works with a combination of statistics, white/black lists (implemented as "magnets").  "buckets" (where every piece of mail gets dropped), and filtering on your usual mail client.  Unfortunately for you dial up users,  POPFile must download each entire email message, not just the headers.

I also recently discovered mailinator.  Mailinator is a cool website that basically allows you to make up any email address and submit it on any form.  The mail address only exists for a couple of hours, and absolutely anyone can read the mail if they can guess your email address there.  If you immediately don't realize the value of this, they you aren't obviously don't have a spam problem.  I love this service.

Thunderbird (by the firefox people) has also recently implemented bayesian statistics built in to their product.  I have tested it, and once you train it, it functions nicely, alas, there are no cool statistics like popfile gives me, so for the time being, I am sticking with what gives me the best statistics (I am sort of a number freak).

Ok spammers, bring it on.

Home | Louisiana History | Bendel Gardens | About Us! | Fighting Spam with Bayesian Statistics | Wireless Security | Digital UV and NIR photography | Genes Geek Gadgets | Genes Soapbox | Miscelaneous Stuff

This site was last updated 03/13/05