BSpam - A probabilistic spam filter

Updated: 6/30/2004
John Rauser - jmr /at/ visi /dot/ com

IMPORTANT NOTE

BSpam is inactive. Shortly after the last release of BSpam, I took a new job and moved across the country. When I moved, I closed my account with my existing ISP, started getting my mail via POP for easy portability, and started using POPFile. At that time I put BSpam development on the back burner, fully intending to return to it one day. Well, almost a year has passed, and I still find myself fully absorbed in other activities, so I am officially declaring BSpam inactive. I encourage you to look at other packages such as CRM114, bogofilter, or POPFile (which does its job pretty darn well).

BSpam is a perl implementation of Paul Graham's Bayesian Spam filtering system: A Plan For Spam.

For more information or to contribute to the BSpam project, see the Sourceforge project page.

DOWNLOAD

You can download BSpam from the download section of the sourceforge project page.

FEATURES

Compatible with procmail.
Correct parsing and decoding of MIME attachments to defeat spammers who hide their content in base64 encoded attachments.
Simple HTML parsing to defeat spammers who include HTML comments or bogus tags to mask their content.
Whitespace compression to defeat spammers who disguise their content l i k e t h i s.
Configurable via an rc file.
Small, simple and hackable. Graham's system draws as much or more from the discipline of engineering as the disciplines of text processing or machine learning. I try to follow this philosophy. "Never let perfect become the enemy of good enough."

INSTALLATION AND CONFIGURATION

BSpam is meant to be used on unix systems with procmail and perl. BSpam assumes unix style line endings. I intended it to be used by folks who read e-mail with text based mailers like elm, mutt, or pine, but perhaps it could be adapted for other systems.

Unpack the tarball into a directory we'll call bspamhome
Edit bspamhome/bspamrc to taste
Put a corpus of spam in the location you indicated in bspamrc
Put a corpus of good email the location you indicated in bspamrc
Run bspam-tune. This will create bspamhome/model
Edit ~/.forward. Make it look like this, WITH the quotes:
"|exec /usr/local/bin/procmail"
Edit ~/.procmailrc, add two recipes:
# Run through bspam if we haven't already :0 fw * ! ^X-BSpam-Verdict: |bspamhome/bspam -bspamhome=bspamhome
# If bspam says it's spam, put it in ~/spam :0 H *^X-BSpam-Verdict: Spam spam
Make ~/.forward, ~/.procmailrc, bspamhome/* world readable.

Note: There is a known bug in bspam 0.4 and 0.5 that causes the scripts to look for bspamrc in ~/bspam and ., but ignore the -bspamhome argument. If you install into a directory other than ~/bspam, a workaround is to make a ~/bspam directory and then make ~/bspam/bspamrc a softlink to the real location of bspamrc. This is fixed in 0.5.1. My apologies for any confusion.

Ok. Now you're ready to go. Send yourself some innocuous mail and make sure it hits your regular inbox. Then send yourself an e-mail with a bunch of spam words in it (be creative), and see that it gets dumped into ~/spam.

MAINTENANCE

To update and maintain the model, append good email to your good corpus, and spam to your bad corpus. Whenever you think it appropriate, run bspam-tune.

TROUBLESHOOTING

If the machine tht processes your e-mail is highly loaded and low on memory, procmail or perl can run out of memory and crash while processing your mail. When this happens, you'll see messages come through with no bspam headers at all. You can use the following procmail recipe to cause mail to be requeued in this case. A word of caution: combined with the right kind of bug in bspam. this recipe can possibly cause an infinite loop. I am unaware of any bugs in bspam that could cause a such loop, but it could happen. I know of bspam users that are using this strategy to good effect, but your mileage may vary.

# If there's still no bspam header, something has gone wrong, requeue.
#:0 fw
#* ! ^X-BSpam-Verdict:
#EXITCODE = 75
#HOST

MOTIVATION

My goal with BSpam was to be small, simple, and hackable. When I looked over the existing implementations of Graham's system, I wanted a system that was a) in a language I'm comfortable with (C, C++, perl), b) designed for use with procmail and text based mailers like elm or mutt, c) small and simple. Bogofilter almost fit the bill, but it seemed to me to be overly big and complicated, thousands of lines of code spread over more than a dozen files [1]. When I first read Graham's article, I figured I could implement the system in about 200 lines of perl.

Indeed, my original implementation of Graham's system was just under 200 lines of perl. The current version has a lot of logic to decode MIME attachments, parse HTML, compress whitespace and otherwise defeat spammer tricks, and it still tips the scales at just over 1,000 lines of perl.

MY EXPERIENCE

I've been running BSpam for a couple months on my mailbox, and have been happy with the results. I get only about 20 spams a day, so it took a while to build up my spam corpus. As my corpus grew I correctly identifed 85-90% of spams. My spam corpus is now well over 1,000 messages and my accuracy has risen to about 99%. The spams I miss now are mostly what Graham calls "spam of the future." Very innocuous text followed by a link, or no text at all, just an IMG tag. I have gotten a handful of false positives, but they have been e-mails from companies and the content has looked vaguely spammy (unsubscribe instructions, lots of HTML, etc.), and I wouldn't have been too upset had I missed them completely.

I suspect that Graham's scoring method really depends on a large corpus (around a thousand messages). Based on the experiments of the SpamBayes folks, I suspect that that Robinson's method does better with a small corpus. With large corpora, I suspect that the two methods are roughly equal, though perhaps Robinson's method still has an edge.

REFERENCES

Paul Graham, A Plan for Spam. http://www.paulgraham.com/spam.html

Paul Graham, Better Bayesian Filtering. http://www.paulgraham.com/better.html

Gary Robinson, Untitled Rant. http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

The spambayes project. Bayesian classifier in Python. http://spambayes.sourceforge.net

The bogofilter project. Bayesian classifier in C. http://bogofilter.sourceforge.net

The POPFile project. A Bayesian classifier in a POP3 proxy. Written in perl. http://popfile.sourceforge.net

NOTES

[1] Bogofilter is big for good reason; speed was a stated design goal. This drove its author(s) to C, and lex. Expressing text filtering in C takes more lines of code than in perl. Also, bogofilter is solid, production quality code, BSpam is not (yet).