| ||||||||||||||
| ||||||||||||||
|
||||||||||||||
Application of Biological Metaphors for Identifying and Killing SpamBy Shawn Evans
“Tricksy spammers, they’ll stop at nothing to get my precious” IntroductionSpam has become the first great plague of the 21st century. Over 60% of all e-mails are spam, costing U.S. corporations more than $10 billion annually, on top of the productivity lost from scanning through e-mail and deleting spam. Along with this, an estimated 5% of spam campaigns are a pure and outright scam, with the remaining majority pitching products that are dubious at best. It used to be parents had to worry about their kids surfing and finding pornographic websites, now we have to worry more about our kids opening an e-mail client and finding a pornographic spam message. Spam must be stopped before it cripples the infrastructure of the internet and drives users away from one of the greatest forms of communication, E-mail.Can Laws Defeat Spam? No. This has to be one of the greatest misconceptions of users. The internet is just that, an “INTERnational NETwork” that cannot be governed by one country’s laws. Spammers can exist anywhere on the internet, meaning they can sling their wares from anywhere in the world, making the laws of one country completely irrelevant. Also, the decentralized, self-organizing design of the internet makes it nearly impossible to regulate by external means. It would be easier to regulate the weather than to regulate the internet. Spam as a Living OrganismUp until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following:
The Plan of AttackTo begin with, let’s start with a view of the filter processes from 10,000 feet up:
Other features that should be considered in any implementation of this filter should be:
Spam Gene MarkersThere are literally hundreds of different markers we could use to identify a message as spam, but for the sake of simplicity, we’re going to view (what I would consider) the top 9 markers.
Here’s an analysis of why I would consider these some of the top markers. Is the format of the e-mail HTML?If you’re like me, the dominant message format of my ham is plain text or rtf, while most of the HTML messages I get are spam. The reason for this is spammers LOVE HTML. It allows them to include images that track who looked at this message and who didn’t, it allows them to hide words that fool Bayesian analysis, and spammers being marketers, allows them to create a visually appealing sales pitch. In other words, if an e-mail is formatted in HTML, it has a higher probability of being spam then if were not.Is the e-mail formatted in valid HTML?To hide phrases and sets of keywords from the user (and not from spam filters, these keywords are there to break the filter), spammers take advantage of the fact that html can be formatted improperly, with certain sections, such as the pitch, still visible to the user. If this trait exists, the message has a high probability of being spam. Again, spammers will include these to intentionally break a spam filter (but not ours!).Is the e-mail encoding base64?Since most spam filters cannot convert base64 encoded text to readable text (but e-mail clients can), encoding messages in this format allows spammers to hide the sales pitch from the filter, while still delivering it to the user. Since only spammers use base64 (too much overhead for normal message use), if this trait exists, it is almost certainly spam.Does the e-mail contain image links?Again, since spammers want to hide the sales pitch from spam filters and not from users, certain spam message will contain one line of text that links to an image on the internet, which gets displayed in the e-mail client. Usually, when this image is loaded by the mail client, it launches a program on the spammers server telling them you viewed this e-mail, hence, you’re a better target for marketing, meaning, congratulations, you get more spam (I cannot believe that e-mail clients do not support a way to disable this feature by default). Since most normal mail does not contain images linked to websites, the existence of this trait means the message has a high probability for being spam.Does the e-mail contain “hidden” text that the user cannot see?A large portion of spam based HTML contains sections of text that are completely hidden from the eyes of users. This is done by either setting the top of the page outside the viewing pane of the e-mail client, or by simply setting the foreground color to the same as the background. Usually, these sections contain a list of random non-spam words. The whole purpose of this exercise is to thwart spam filters, because most look at the entire message, versus the piece the user sees. With these random, non-spam words, a standard Bayesian spam filter will calculate a lower spam probability for the e-mail, allowing it through as ham. If this trait exists, the e-mail is spam.NOTE: Our filter will completely ignore these areas of hidden text, and look just at the message the user sees. Does this e-mail have a large number of recipients?This gene may or may not apply, so its implementation is optional. Generally, spam messages will have a long list of recipients. But, if you receive work related e-mail, or your friend shoots an e-mail out to 20 people at a time, this no longer becomes a significant trait. This is one I leave to you to implement.What’s the ratio of links to words in this e-mail?Most spam will contain at least one link to their website or wherever so that you can “unsubscribe” (otherwise known as prove your e-mail is valid, and add it to 200 other spam lists). I would not consider this a dominant spam trait, but if you have 5 links and only 10 words in the e-mail body, then there’s a good chance this is spam.What’s the ratio of misspelled words to words in this e-mail?Because spammers will stop at nothing to hide words that will tip off a spam filter, they will often fill an e-mail with phrases like “v1agr@a” and other fun gibberish to thwart a filter. Again, I would not necessarily consider this a dominant trait of just spam (come on, we all have that friend who spells at a 2nd grade level), but combined with other traits, it might be a good indication if this is spam or not.What’s the Bayesian spam probability of this e-mail?If the other traits described above to not tip off the filter, this one will become very dominant. The Bayesian spam probability is what most e-mail filters to date use to filter e-mail. Basically, it examines all string tokens within an e-mail, and calculates if the token appears often in spam messages. But, because of all the nasty little tricks spammers now play, this generally only works if none of the above traits exist. But, since our filter is smart enough to look at just what the user sees, is becomes much more effective in our application.Now that we have defined what the overall process is going to look like, let’s delve into the code. Putting it TogetherLet’s go ahead and get into the good stuff by going over the code for our application.NOTE: The sample code for this application is in C#. C# was chosen over C++ so beginners could better see the structures of the process, and C# was chosen over Java because of the inherent performance advantages of .NET. If anyone would like to re-factor the code into C++ or Java, please do and I will include it with the distribution of this article. The namespace defined for our engine is SpamAwareLib. Its structure is as follows:
DataObjects – As the name implies, the objects here represent units and collections of data within the engine.
NOTE: The Dictionary.cs file is the object that will contain the dictionary for spell checking. I did include the serialized version of the object along with a 178,000+ word dictionary text file. If the Dictionary.cs file is modified, you will need to reload and re-serialize the Dictionary object with the file. MsgProcessors – This is where the primary message parsing and gene finding takes place. The ChromosomeGen object will generate a chromosome by passing a mail message through each one of the processors in this namespace. NeuralNet – Again, as the name implies, a collection of objects within this namespace that represents our ANN and its internal data objects. This is where our chromosome gets processed to give us the net probability if the message is spam, ham, or mystery meat. Utility – This is where our chromosome generator and neural network generator exists. Our chromosome generator takes as an input the mail message (along with the dictionary, the mail corpuses, and the neural net), and then generates the appropriate chromosome for the message, and the net probability if the message is ham, spam, or mystery meat. To see a code representation of how the engine works, let’s take a look at the chromosome generator method GenerateChromosomeAndCheck.
public static Chromosome GenerateChromosomeAndCheck(Message msg,
Dictionary dict,
Corpus ham,
Corpus spam,
NNSpamAware net)
{
string body = msg.Body;
Chromosome stats = new Chromosome();
body = body.ToLower();
stats.IsHTML =
Chromosome.ConvertBoolToDouble(HTMLParser.IsHTML(body));
stats.ContainsImgLinks =
Chromosome.ConvertBoolToDouble(HTMLParser.ContainsImgLinks(body));
stats.IsValidHTMLBody =
Chromosome.ConvertBoolToDouble(HTMLParser.IsHTMLBodyValid(body));
stats.ContainsHiddenText =
Chromosome.ConvertBoolToDouble(
HTMLParser.ContainsInvisibleHTMLText(body));
stats.LinkDegree =
Chromosome.CalcLinkDegree(HTMLParser.GetLinkCount(body));
stats.RecipientDegree =
Chromosome.CalcRecipientDegree(msg.GetRecipients());
// Run parts of chromosone against a pre-net
// (This is an optimization)
stats.SpamProbability = net.RunPreNet(stats);
// This can be adjusted... Calculating the misspelled word ratio and
// any Bayesian probability is time consuming
if (stats.SpamProbability < .66)
{
string parsedBody = "";
if (stats.IsHTML > .5)
{
parsedBody = HTMLParser.CleanHTML(body);
}
else
{
parsedBody = body;
}
stats.MisspelledWordRatio =
TokenParser.GetMisspelledWordRatio(parsedBody, dict);
stats.BayesianSpamProbability =
BayesianAnalyzer.CalcSpamProbabiliy( msg.Subject +
" " + parsedBody, ham, spam );
// Run full choromosone against a net (This is an optimization)
stats.SpamProbability = net.RunFullNet(stats);
}
return stats;
}
In the first part of this code, we create a new Chromosome data object, and then we pass the lowercase body of the mail message to various message processors, that give us our basic gene information. The next step can be skipped based on your implementation, but for sake of speed, we calculate a “pre-probability” to see if the message is spam by passing the chromosome data we have at this point to our neural network. If the probability of it being spam is less than .66 (if the network tells us it’s either mystery meat or ham at this point), we calculate the rest of the chromosome, and pass it to our other neural network. We do this mainly to save time, for the simple fact the spell check and the Bayesian probability test can use up a lot of processing time. After the completion of this method, we will know if our message is ham or spam, and we should then move the message to the appropriate mail folder.
NOTE: For further details on the implementation of the message processors (gene finders), please see the code attached to the project.
Training the NetworksBefore our filter will work, we need to complete an important first step, training of the neural networks on our existing e-mail. First ensure that you have two separate sources of e-mail, one for ham (good e-mail), and one for spam. Then, load the mail Corpus data objects, the chromosome history, and the dictionary object, and we’re ready to generate our neural networks. The code below illustrates this through the GenerateNetwork method within the ANNTrainer.cs file.
public static NNSpamAware GenerateNetwork(Message [] hamMsgs,
Message [] spamMsgs,
Corpus hamCorpus, Corpus spamCorpus,
ChromosomeHistory hist, Dictionary dict)
{
// First, add the tokens of our messages to our corpuses
for (int i = 0; i < hamMsgs.Length; i++)
{
TokenParser.AddMsgTokensToCorpus(hamMsgs[i].Subject,
hamMsgs[i].Body, hamCorpus);
}
for (int i = 0; i < spamMsgs.Length; i++)
{
TokenParser.AddMsgTokensToCorpus(spamMsgs[i].Subject,
spamMsgs[i].Body, spamCorpus);
}
// Generate the chromosomes for all the mail items
for (int i = 0; i < hamMsgs.Length; i++)
{
Chromosome chromosome =
ChromosomeGen.GenerateChromosome(hamMsgs[i],
dict, hamCorpus,
spamCorpus, 0);
hist.AddChromosome(hamMsgs[i].Id.ToString(), chromosome);
}
for (int i = 0; i < spamMsgs.Length; i++)
{
Chromosome chromosome =
ChromosomeGen.GenerateChromosome(spamMsgs[i],
dict, hamCorpus,
spamCorpus, 1);
hist.AddChromosome(spamMsgs[i].Id.ToString(), chromosome);
}
// Perform Network Training
NNSpamAware ann = new NNSpamAware();
// Train for 1,000 epochs. A better method would be to look for an
// acceptable MSE, but for simplicities sake, we train for a set
// number of epochs
for (int epoch = 0; epoch < 1000; epoch ++)
{
for (int i = 0; i < hist.Count; i++)
{
ann.TrainPreNet(hist[i]); // Train the pre Net
ann.TrainFullNet(hist[i]); // Train the full Net
}
}
return ann;
}
As the code depicts, training our networks is a three stage approach. First, we add the tokens of the messages we’re going to train to their respective corpus. Once we have done this, we can then generate a chromosome for each message, which we add to our chromosome history. Finally, we pass chromosomes from our history into the training procedure of our neural network, which in turn causes the network to learn what chromosomes represent spam.
ResultsThe engine presented in this article has proved to be extremely accurate in the identification of spam. In testing, the model has shown to have one false negative per 1000 e-mails. With further tweaking and with the identification of more genes, I’m quite sure this could be increased to about 1 in every 5000 e-mails or greater.ConclusionsObviously, spam cannot be eliminated with just an algorithm or an approach. To effectively kill the practice of spam, first and foremost, people must stop buying things from spammers. All products, companies, and organizations who partake in spam should be boycotted, and a good start would be a list of this information, and posting of it on the internet.With that said, a sucker is born everyday, and certain individuals will purchase items from spammers no matter how shady or underhanded the promotion is. As long as these people continue to buy from spammers, and the cost of spamming does not escalate, spamming will continue. To combat this problem, the “see-no-evil, buy-no-evil” approach can be taken. With this, vendors need to include EFFECTIVE anti-spam tools within their e-mail applications. And not just in new versions, but also through the development and deployment of service packs and free add-ons to existing and previous product lines. The more spam messages the user is oblivious to, the least likely he or she will partake in a spam promotion. This in turn makes spam less effective, significantly decreasing the return on investment of the practice, which will in turn greatly reduce it and/or kill it. Last but not least, no single, elementary solution can be used to effectively filter and kill spam. It will take a solution that recognizes spam as an organism, and that can evolve as quickly as it can. Contact InformationIf would like to contribute feedback, alternative methods, code, a high-paying job, etc, you can contact me at jarhead4067@hotmail.com (assuming the spammers don’t kill my account first).
Submitted: 02/03/2004 Article content copyright © Shawn Evans, 2004.
|
|
|||||||||||||
All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -