previous page: Anti-SPAM Techniques: Bayesian Content Filtering
page up: Anti-SPAM, Anti-Phishing and Anti-Viruses Techniques
next page: Anti-SPAM Techniques: Black Listing (RBL)

Anti-SPAM Techniques: Collaborative Content Filtering


This article is a part of the series on undesired email (spam, phishing, viruses, etc.). The material covers the Poisons and the Remedies.

By Stas Bekman.

Published: May 15th 2006

Anti-SPAM Techniques: Collaborative Content Filtering

Collaborative filtering is a relatively new approach to content filtering. It is one of those Web 2.0 killer technologies. Here rather than employing someone to attract and analyse spam, or having each user train a Bayesian filter, the whole community works together. The technique is to have many users share their judgements of what's a undesired email and what not.

It works as following: every time you receive a mail, a special application may suggest to you whether it's SPAM or not. You can then accept the suggestion or override it. i.e. if the application has suggested that a message is a spam and you think it's not, then you say: no, it's not a spam. and if you think that the message is spam but the application didn't suggest that, then you mark it as spam. Whatever the outcome is, it's reported to a central database (usually using a simple message checksum or a more elaborate fingerprint, since spammers are smart enough to randomise certain parts of the content defeating simple checksums). So the next time someone receives the same message the system will be able to make a better suggestion, depending on how you "voted" on it.

Now imagine a million of people collaborate on their email filtering. So if there is virus outbreak the first users to receive it will quickly report it as SPAM and users who will get to their mailbox later will already be told by their system that the message is a SPAM. Now add to the fact that this million of users is spread around the world -- there is always someone awake and reading their email, so if you are asleep while your mailbox has received some SPAM, by the time you wake up other users in the collaborative community will already alert your system, which in turn will tell which messages are SPAM.

Of course there is always a way to defeat the system, but the more users try to collaborate the harder it gets. The central system maintains a reputation for each user, so if you decide to report a spam message as not spam, but 10,000 users have reported it as a spam, the system will just ignore your vote for this email and if you continue doing that it'll ignore your input for future emails too.

The main advantage of this technique is that an outbreak of undesired email can be detected in a very short time, sometimes under one minute. Therefore if there are many participating users -- the damage that can be done by this outbreak can be very minor.

Another advantage is that neither anti-spam vendors need to spend resources on spam detection nor end users (well, they spend a short time only if they are the first to receive new undesired email while the rest of the collaborating community was away, reading the latest Harry Potter sequel).


In my humble opinion this is a killer technology and if implemented right, will be very hard to defeat by spammers. It's hard to beat an army of million users doing the filtering and sharing the info with the rest of the world.

Here at MailChannels, Corp we have integrated Cloudmark (http://cloudmark.com/)'s implementation of this technology - and so far it seems to be working really well, but time will show whether spammers (who are usually very smart people) will find a way to defeat that system.

May be one day instead of trying to outsmart each other, all those very smart folks from both camps (spam and anti-spam) could direct their energy to making the world a better place to be in.


Here are some vendors supporting this technique (including open-source solutions):

Please notify me if you know of others.

Vipul's Razor (http://razor.sourceforge.net/)
Vipul's Razor is a distributed, collaborative, spam detection and filtering network. It's an Open Source Software. Through user contribution, Razor establishes a distributed and constantly updating catalogue of spam in propagation that is consulted by email clients to filter out known spam. Detection is done with statistical and randomized signatures that efficiently spot mutating spam content. User input is validated through reputation assignments based on consensus on report and revoke assertions which in turn is used for computing confidence values associated with individual signatures.

Cloudmark (http://cloudmark.com/)
is at the moment one of the most effective and highest performing anti-spam and anti-phishing protection available today. It consistently blocks over 98% of spam and phishing attacks in real-time with near-zero false positives. Its products provide zero-hour virus protection for more rapid response and a double-layer of protection to traditional anti-virus solutions. It's free for non-commercial users, since those users power the technology :) Cloudmark's solution is originally based on Vipul's Razor. MailChannels, Corp TrafficControl integrates Cloudmark (http://cloudmark.com/) as a gateway solution in a transparent SMTP proxy.

SpamWatch (http://www.cs.berkeley.edu/%7Ezf/spamwatch/)
SpamWatch should be considered alpha-quality software because of its research prototype nature...


Pyzor (http://pyzor.sourceforge.net/)
is a python implementation Vipul's Razor, but using a different protocol.

Distributed Checksum Clearinghouse (http://www.rhyolite.com/anti-spam/dcc/)
. The DCC (Distributed Checksum Clearinghouse) is an anti-spam content filter. The distributed checksums include values that are constant across common variations in bulk messages, including "personalizations."



Related Links

And here are some pointers for additional information on the subject:


Personal, Collaborative Spam Filtering
A paper from Trinity College, Dublin (pdf)

Wikipedia on DCC
Additional information about Collaborative Filtering

Vipul Razor's Documentation (http://razor.sourceforge.net/docs/)
Installation details

Related Books
A selection of books on Collaborative Filtering

Fighting Spam with Reputation Systems (http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=346)
Leveraging the power of communities and reputations can be an effective weapon against spam.

Schrödinger's Collaborative Spam Filter (http://www.unto.net/unto/work/schrdingers-collaborative-spam-filter/)
DeWitt Clinton presents a different Collaborative Spam Filter idea.

Spam Agent Architecture (http://linux.ucla.edu/%7Elarva/spam-agent/)
- the goal of Spam Agent is construction of a system that automatically detects and blocks unsolicited email in a distributed fashion.

Reputation Network Analysis for Email Filtering (http://trust.mindswap.org/papers/emailPaper/)
In addition to traditional spam detection applications, new methods of filtering messages - including whitelist and social network based filters - are being investigated to further improve on mail sorting and classification. In this paper, we present an email scoring mechanism based on a social network augmented with reputation ratings.

Collaborative Filtering Research Papers (http://jamesthornton.com/cf/)
55+ recent papers



Continue reading about other Remedies or jump to the email-related Poisons section.

previous page: Anti-SPAM Techniques: Bayesian Content Filtering
page up: Anti-SPAM, Anti-Phishing and Anti-Viruses Techniques
next page: Anti-SPAM Techniques: Black Listing (RBL)