|
Spam is a very real
problem that many people have to deal with on a daily basis. For
those that have decided to do something about it and start to
investigate the options available in spam filtering, this article
provides a brief introduction to your options and the types of spam
filters available.
Despite the
bewildering array of spam filters available today, all claiming to
the best one “of its kind” there are really just five filtering
methodologies in general use today and all products rely on one, or
a combination of these:
Content-Based
Filters
“In the
beginning, there were content-based filters.”
These filters
scan the contents of the and look for tell-tale signs that the
message is spam. In the early days of spamming it was quite simple
to look out for “Kill Words” such as
”Lose Weight” and mark a message as spam if it was found.
Very soon
though, spammers got wise to this and started resorting to all kinds
of tricks to get their message past the filters. The days of
“obfuscation” had begun.
We started getting messages containing the phrase “L0se Welght”
(Notice the zero for “o” and “l” for “i”) and even more bizarre –
and sometimes quite ingenious – variations.
This rendered
basic content-based filters somewhat ineffective, although there are
one or two on the market now that are clever enough to “see through”
theses attempts and still provide good results.
Bayesian Based
Filters
“The
Reverend Bayes comes to the rescue”
Born in London
1702, the son of a minister, Thomas Bayes developed a formula which
allowed him to determine the probability of an event occurring based
on the probabilities of two or more independent evidentiary events.
Bayesian filters
“learn” from studying known good and bad messages. Each message is
split into single “word bytes”, or tokens and these tokens are
placed into a database along with how often they are found in each
kind of message.
When a new
message arrives to be tested by the filter, the new message is also
split into tokens and each token is looked up in the database.
Extrapolating results from the database and applying a form of the
good reverend’s formula, know as the “Naive Bayesian” formula, the
message is given a “spamicity” rating and can be dealt with
accordingly.
Bayesian filters
typically are capable of achieving very good accuracy rates (>97% is
not uncommon), and require very little on-going maintenance.
Whitelist/Blacklist
Filters
“Who goes
there, friend or foe?”
This very basic
form of filtering is seldom used on its own nowadays, but can be
useful as part of a larger filtering strategy.
A “whitelist” is
nothing more than a list of e-mail addresses from which you wish to
accept communications. A whitelist filter would only accept messages
from these people and all others would be rejected
A “blacklist”,
conversely, is a list of e-mail addresses - and sometimes IP
Addresses (computer identification addresses) - from which
communications will not be accepted.
While this may
seem like a good idea from the outset, a whitelist methodology is
too restrictive for most people and, as virtually all spam e-mails
carry a forged “from” address, there is little point in collecting
this address to ban it in future as it is very unlikely to be the
same next time.
There are bodies
on the internet that maintain a list of known “bad” sources of
e-mail. Many filters today have the ability to query these servers
to see if the message they are looking at comes from a source
identified by this Internet-based blacklist, or RBL. While being
quite effective, they do tend to suffer from “false positives” where
good messages are incorrectly identified as spam. This happens often
with newsletters.
Challenge/Response Filters
“Open
sesame!”
Challenge/Response filters are characterised by their ability to
automatically send a response to a previously unknown sender asking
them to take some further action before their message will be
delivered. This is often referred to as a "Turing Test" - named
after a test devised by British mathematician Alan Turing to
determine if machines could “think”.
Recent years
have seen the appearance of some internet services which
automatically perform this Challenge/Response function for the user
and require the sender of an e-mail to visit their web site to
facilitate the receipt of their message.
Critics of this
system claim it to be too drastic a measure and that it sends a
message that "my time is more important than yours" to the people
trying to communicate with you.
For some low
traffic e-mail users though, this system alone may be a perfectly
acceptable method of completely eliminating spam from their inbox -
one step above the "Whitelist" system outlined above.
Community
Filters
“A united
front”
These types of
filters work on the principal of "communal knowledge" of spam. When
a user receives a spam message, they simply mark it as such in their
filter. This information is sent to a central server where a
“fingerprint” of the message is stored.
After enough
people have “voted” this message to be spam, then it is stopped from
reaching all the other people in the community.
This type of
filtering can prove to be quite effective, although it stands to
reason that it can never be 100% effective as a few people have to
receive the spam for it to be “flagged” in the first place. Just
like its similar cousin the Internet black list (RBL), this system
also can suffer from “false positives”, or messages incorrectly
identified as spam.
Hopefully you
are now armed with a little more information to be able to make an
informed decision on the best spam filter for you.
back to
top
|