By Stas Bekman.
Published: May 15th 2006
Collaborative filtering is a relatively new approach to content filtering. It is one of those Web 2.0 killer technologies. Here rather than employing someone to attract and analyse spam, or having each user train a Bayesian filter, the whole community works together. The technique is to have many users share their judgements of what's a undesired email and what not.
It works as following: every time you receive a mail, a special application may suggest to you whether it's SPAM or not. You can then accept the suggestion or override it. i.e. if the application has suggested that a message is a spam and you think it's not, then you say: no, it's not a spam. and if you think that the message is spam but the application didn't suggest that, then you mark it as spam. Whatever the outcome is, it's reported to a central database (usually using a simple message checksum or a more elaborate fingerprint, since spammers are smart enough to randomise certain parts of the content defeating simple checksums). So the next time someone receives the same message the system will be able to make a better suggestion, depending on how you "voted" on it.
Now imagine a million of people collaborate on their email filtering. So if there is virus outbreak the first users to receive it will quickly report it as SPAM and users who will get to their mailbox later will already be told by their system that the message is a SPAM. Now add to the fact that this million of users is spread around the world -- there is always someone awake and reading their email, so if you are asleep while your mailbox has received some SPAM, by the time you wake up other users in the collaborative community will already alert your system, which in turn will tell which messages are SPAM.
Of course there is always a way to defeat the system, but the more users try to collaborate the harder it gets. The central system maintains a reputation for each user, so if you decide to report a spam message as not spam, but 10,000 users have reported it as a spam, the system will just ignore your vote for this email and if you continue doing that it'll ignore your input for future emails too.
The main advantage of this technique is that an outbreak of undesired email can be detected in a very short time, sometimes under one minute. Therefore if there are many participating users -- the damage that can be done by this outbreak can be very minor.
Another advantage is that neither anti-spam vendors need to spend resources on spam detection nor end users (well, they spend a short time only if they are the first to receive new undesired email while the rest of the collaborating community was away, reading the latest Harry Potter sequel).
In my humble opinion this is a killer technology and if implemented right, will be very hard to defeat by spammers. It's hard to beat an army of million users doing the filtering and sharing the info with the rest of the world.
Here at MailChannels, Corp we have integrated Cloudmark (http://cloudmark.com/)'s implementation of this technology - and so far it seems to be working really well, but time will show whether spammers (who are usually very smart people) will find a way to defeat that system.
May be one day instead of trying to outsmart each other, all those very smart folks from both camps (spam and anti-spam) could direct their energy to making the world a better place to be in.
Here are some vendors supporting this technique (including open-source solutions):
Please notify me if you know of others.
And here are some pointers for additional information on the subject:
Spam with Reputation Systems (http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=346)
Collaborative Spam Filter (http://www.unto.net/unto/work/schrdingers-collaborative-spam-filter/)
Network Analysis for Email Filtering (http://trust.mindswap.org/papers/emailPaper/)
Filtering Research Papers (http://jamesthornton.com/cf/)