Anti-SPAM Techniques: Bayesian Content Filtering

Description

This article is a part of the series on undesired email (spam, phishing, viruses, etc.). The material covers the Poisons and the Remedies.

By Stas Bekman.

Published: May 15th 2006

Anti-SPAM Techniques: Bayesian Content Filtering

When Bayesian analysis technique is used, it's the statistics that do all the work. The problem here is that a Bayesian filter requires training - so when you just start it for the first time, you need to feed it with your good mail and with your undesired email (telling it whether it's a good mail or not). From that point on a Bayesian filter will try to decide what's spam and what not and sort the email to different folders. You still need to constantly review both folders and make sure that you tell the filter if you've spotted misplaced email (i.e. sometimes it will miss a spam, and sometimes it'll put a valid email into a spam folder). Assuming that you receive emails that are quite homogeneous in nature -- in a relatively short time it'll starting making less and less mistakes. However since spammers are trying to outsmart statistics, they come up with gibberish content emails which often times cause a miss and you get a spam in your INBOX.

The main disadvantage here is that it requires constant training. Even though after a certain point it'll catch most of the spam and have almost no false-positive. This approach works the best if each user has their own filter, since different users receive different emails - as they say: someone's spam is someone else's ham.

In this approach it's the user that wastes their time on training the filter, therefore if your organisation has lots of users than you may be wasting a lot of time across the board. However this solution usually doesn't cost anything to the company, since the real knowledge base is provided by users.

IMHO

In my humble opinion this technique could be very useful if each user trains its own bayesian filter. However this doesn't scale as well as other techniques, that remove most of the spam at the gateway. i.e. if you have a big organization, each users spending a few minutes feeding the bayesian filter adds up to a lot of time across the organisation.

Vendors

Here are some vendors supporting this technique (including open-source solutions):

Kaspersky Internet Security (http://www.kaspersky.com)
(Commercial) and its other products use bayesian-based filtering.

CRM114 (http://crm114.sourceforge.net/)
(OSS) - the Controllable Regex Mutilator. Supports regexes, sparse binary polynomial matching with a Bayesian Chain Rule evaluator, Hidden Markov Model, and more.

Death2Spam (http://death2spam.net/)
(Commercial) provides filtering for personal email accounts via a POP proxy server.

POPFile (http://popfile.sourceforge.net/)
(OSS) is an automatic mail classification tool. Once properly set up and trained, it will scan all email as it arrives and classify it based on your training. You can give it a simple job, like separating out junk e-mail, or a complicated one -- like filing mail into a dozen folders. Think of it as a personal assistant for your inbox.

SpamAssassin (http://spamassassin.apache.org/)
(OSS) - is a mail filter which attempts to identify spam using a variety of mechanisms including text analysis, Bayesian filtering, DNS blocklists, and collaborative filtering databases.

SpamBayes (http://spambayes.sourceforge.net/)
(OSS) provides a statistical (commonly, although a little inaccurately, referred to as Bayesian) anti-spam filter, initially based on the work of Paul Graham. The major difference between this and other, similar projects is the emphasis on testing newer approaches to scoring messages. While most anti-spam projects are still working with the original graham algorithm, the developers found that a number of alternate methods yielded a more useful response.

SpamProbe (http://spamprobe.sourceforge.net/)
(OSS) relies on a Bayesian analysis of the frequency of words used in spam and non-spam emails received by an individual person. The process is completely automatic and tailors itself to the kinds of emails that each person receives.

SpamSweep (http://www.bainsware.com/spamsweep/)
(Commercial) is an advanced bayesian spam filter with a simple, easy to understand interface. SpamSweep seamlessly combines many filtering technologies, including domain and relay blacklists, sender whitelisting, and a bayesian filter to automatically delete spam messages before they're downloaded by your email client.

trimMail Inbox (http://www.trimmail.com/)
(Commercial) - an easy, powerful, affordable way to protect your mail servers from SPAM, viruses, dictionary attacks, and other hazards of the internet

Bogofilter (http://bogofilter.sourceforge.net/)
(OSS) - is a mail filter that classifies mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections..

SpamOracle (http://pauillac.inria.fr/~xleroy/software.html)
(OSS) proceeds by statistical analysis of the words that appear in the e-mail, comparing the frequencies of words with those found in a user-provided corpus of known spam and known legitimate e-mail. The classification algorithm is based on Bayes' formula, and is described in Paul Graham's paper, A plan for spam (http://www.paulgraham.com/spam.html).

Sophos PureMessage (http://www.sophos.com/products/)
(Commercial) performs statistical analysis too.

DSPAM (http://www.nuclearelephant.com/projects/dspam/)
(OSS) is an adaptive filter. Which means it is capable of learning and adapting to each user's email. Implemented by Jonathan Zdziarski, the author of the book Ending Spam.

SpamSieve (http://c-command.com/spamsieve/)
(Commercial) adds superior, easy to use Bayesian spam filtering to Mac OS X email clients.

MX Logic (http://www.mxlogic.com/mxladvantage/)
(Commercial) targets small and medium-sized businesses.

McAfee SpamKiller (http://www.mcafee.com/)
(Commercial) provides Bayesian filtering technology

GFI MailEssentials for Exchange/SMTP/Lotus (http://www.gfi.com/mes/)
(Commercial) - Is an anti-spam filter for mail servers (Exchange Server, Lotus Domino and others) which uses Bayesian filtering among other methods of catching spam, such as third-party DNSBL checks, IP reputation filtering, email header analysis and support for Sender Policy Framework (SPF).

Brightmail (http://www.brightmail.com/)
(Commercial) - it was acquired by Symantec

Please notify me if you know of others.