At the forefront of Artificial Intelligence
  Home Articles Reviews Interviews JDK Glossary Features Discussion Search
Home » Articles » Natural Language Processing » Spam Filtering

Application of Biological Metaphors for Identifying and Killing Spam

By Shawn Evans

“Tricksy spammers, they’ll stop at nothing to get my precious”

Introduction

Spam has become the first great plague of the 21st century. Over 60% of all e-mails are spam, costing U.S. corporations more than $10 billion annually, on top of the productivity lost from scanning through e-mail and deleting spam. Along with this, an estimated 5% of spam campaigns are a pure and outright scam, with the remaining majority pitching products that are dubious at best. It used to be parents had to worry about their kids surfing and finding pornographic websites, now we have to worry more about our kids opening an e-mail client and finding a pornographic spam message. Spam must be stopped before it cripples the infrastructure of the internet and drives users away from one of the greatest forms of communication, E-mail.

Can Laws Defeat Spam? No. This has to be one of the greatest misconceptions of users. The internet is just that, an “INTERnational NETwork” that cannot be governed by one country’s laws. Spammers can exist anywhere on the internet, meaning they can sling their wares from anywhere in the world, making the laws of one country completely irrelevant. Also, the decentralized, self-organizing design of the internet makes it nearly impossible to regulate by external means. It would be easier to regulate the weather than to regulate the internet.

Spam as a Living Organism

Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following:
  1. Spam evolves and adapts based off the rules of natural selection

    Through the fight against spam, spam has demonstrated an uncanny ability to adapt to the conditions of its environment, namely the internet. When one barrier against a strain of spam is put up, another, resistant strain appears. This is similar to how bacteria builds immunity against antibiotics, the strains that are not immune will die, while the ones that are immune take over and become the dominant, drug resistant strain. This leads to the belief that spam will not die until the barriers of its environment evolve faster than it does.

  2. Spam lives within an eco-system, and we’re its food

    The internet is a complex chain of systems that all rely on each for the other’s survival. Without an internet protocol, a web browser couldn’t exist. Without web servers, the web wouldn’t exist. Without … (you get the picture). This chain of systems can be likened to an eco-system, with spam existing at a parasitic level of species within this system. It consumes resources (bandwidth, servers, time) in its attempt to reach its primary host: us. Once spam reaches its target, its sole purpose is to solicit its “food” from us, primarily money. If it is effective, that strain of spam lives and continues to propagate, otherwise it will die. Can the internet eco-system be modified so spam can’t feed?

  3. Spam has genetic traits and markers

    Just like any organism, spam contains certain traits that uniquely identify it. This can be a combination of words, information inside the header of the e-mail, the format of the message (HTML, plain text, rtf), the message encoding (base64), does it contain image links, the number of links, does it contain hidden text, so on and so forth. Up until recently, spam filters have primarily focused on just one of these traits, the wording of the e-mail. Spam, being an organism, evolved so this marker was hidden within its code, making it difficult at best to filter. It did this by including random, non-spam words in hidden areas of the e-mail, by modifying words like Viagra with V1@gr@, sending spam as image links, and by encoding the message in a format that filters could not read. The good news is this “gene” is still present, and can be unlocked by identifying the defensive genes within the spam message.

    The goal of this article is to present a highly adaptive spam filtering technology that treats spam as an artificial organism. To achieve this goal, the technique described will employ an Artificial Neural Network (ANN) that will learn what features within a set of gene markers qualify an e-mail as spam or “ham” (good e-mail). Prior knowledge of ANNs and the methods required to train them is highly recommended before continuing with this article.

    NOTE: This article will present just the filter engine. How to implement the engine on specific clients and servers is left entirely to the reader.

The Plan of Attack

To begin with, let’s start with a view of the filter processes from 10,000 feet up:
  1. Process and parse an e-mail into string tokens.
  2. Look at the defined genetic markers within the e-mail, and generate numerical statistics on each marker between {0, 1}.
  3. Take the collection of statistics generated, and feed them as input vectors into our Artificial Neural Network, a Multi-Layer Perceptron (MLP).
  4. The MLP will have one output, which will be a probability between {0, 1} on whether the e-mail is spam, ham (good e-mail), or mystery meat (unknown status).
  5. Based off the output of the MLP, the e-mail will be moved to an appropriate mail corpus.
Once we have the above defined, we are now ready to train our mail filter. What is so powerful about this approach is that the filter can learn from the user’s preference of what is ham and what is spam (one man’s trash is another’s treasure). To do this, we will need two mail corpuses, one of ham and one of spam. The process of training the filter is as follows:
  1. Process and parse both corpuses of e-mail into string tokens.
  2. Look at the defined genetic markers within each e-mail, and generate two collections of statistics on each marker between {0, 1} for each message within each corpus. One collection will be for the spam corpus, and the other for the ham corpus.
  3. Take each collection of statistics generated, and using gradient descent/back-propagation, train the MLP on each collection of statistics, with the desired output of the spam input vectors being 1, and the desired output of the ham input vectors being 0. Continue training until an acceptable mean-squared error (MSE) is reached.
What is great about this method is that a generic, base MLP can be distributed with the system or updated through the internet, and then the user can do further training of the network to tweak it for their own personal preferences based off their mail corpuses.

Other features that should be considered in any implementation of this filter should be:

  1. A sender white list and black list. If an e-mail is from someone or from some domain on the white list, it is automatically accepted as ham. If it is from someone on the black list, the inverse occurs and it is automatically rejected as spam. If the sender is on neither list, than the e-mail is processed through the above system.
  2. A way that the user can augment sensitivities to the input vector of our MLP. For instance, say we parse an e-mail to see if it has HTML image links (this would be one of the genes we track). If it does contain an image link, we know that 99.9% of messages that contain this trait are spam. But, certain people may not want these e-mails categorized as spam, because they deal with good e-mail that contains image links. Giving the user the ability to inhibit this marker with a setting like “How often do you receive valid e-mail that contains images?” would serve well.
  3. Because of the adaptive nature of spam, a central server that could deploy new filters on a day-to-day basis would be a must have. The filter described here will adapt on its own to the gene markers its familiar with, but not to new defensive gene markers that spammers will likely include down the road. It would make this system extremely efficient, and with wider use, make it cost prohibitive for spammers to continue their ways. In essence, we want this system to evolve as fast as spam.
Now that the major processes are defined, let’s take a look at what genes we’re looking for.

Spam Gene Markers

There are literally hundreds of different markers we could use to identify a message as spam, but for the sake of simplicity, we’re going to view (what I would consider) the top 9 markers.
  1. Is the format of the e-mail HTML?
  2. Is the e-mail formatted in valid HTML?
  3. Is the e-mail encoding base64?
  4. Does the e-mail contain image links?
  5. Does the e-mail contain “hidden” text that the user cannot see?
  6. Does this e-mail have a large number of recipients?
  7. What’s the ratio of links to words in this e-mail?
  8. What’s the ratio of misspelled words to words in this e-mail?
  9. What’s the Bayesian spam probability of this e-mail?
These markers may or may not signify an e-mail as spam, depending on the type of ham a user receives. That’s why we allow our users to train the system, so that the choice of whether an e-mail is ham or spam is not based on just one statistic, but a set statistics that are based on the user’s choice of what is spam and what is not. We allow the system to scan previous e-mails that the user has classified as ham or spam, calculate the values for each of these genes, and then plug the calculated values into our MLP so that the filter can learn the good from the evil. The power of this method is that our filter can adapt as quickly as the spammers.

Here’s an analysis of why I would consider these some of the top markers.

Is the format of the e-mail HTML?

If you’re like me, the dominant message format of my ham is plain text or rtf, while most of the HTML messages I get are spam. The reason for this is spammers LOVE HTML. It allows them to include images that track who looked at this message and who didn’t, it allows them to hide words that fool Bayesian analysis, and spammers being marketers, allows them to create a visually appealing sales pitch. In other words, if an e-mail is formatted in HTML, it has a higher probability of being spam then if were not.

Is the e-mail formatted in valid HTML?

To hide phrases and sets of keywords from the user (and not from spam filters, these keywords are there to break the filter), spammers take advantage of the fact that html can be formatted improperly, with certain sections, such as the pitch, still visible to the user. If this trait exists, the message has a high probability of being spam. Again, spammers will include these to intentionally break a spam filter (but not ours!).

Is the e-mail encoding base64?

Since most spam filters cannot convert base64 encoded text to readable text (but e-mail clients can), encoding messages in this format allows spammers to hide the sales pitch from the filter, while still delivering it to the user. Since only spammers use base64 (too much overhead for normal message use), if this trait exists, it is almost certainly spam.

Does the e-mail contain image links?

Again, since spammers want to hide the sales pitch from spam filters and not from users, certain spam message will contain one line of text that links to an image on the internet, which gets displayed in the e-mail client. Usually, when this image is loaded by the mail client, it launches a program on the spammers server telling them you viewed this e-mail, hence, you’re a better target for marketing, meaning, congratulations, you get more spam (I cannot believe that e-mail clients do not support a way to disable this feature by default). Since most normal mail does not contain images linked to websites, the existence of this trait means the message has a high probability for being spam.

Does the e-mail contain “hidden” text that the user cannot see?

A large portion of spam based HTML contains sections of text that are completely hidden from the eyes of users. This is done by either setting the top of the page outside the viewing pane of the e-mail client, or by simply setting the foreground color to the same as the background. Usually, these sections contain a list of random non-spam words. The whole purpose of this exercise is to thwart spam filters, because most look at the entire message, versus the piece the user sees. With these random, non-spam words, a standard Bayesian spam filter will calculate a lower spam probability for the e-mail, allowing it through as ham. If this trait exists, the e-mail is spam.

NOTE: Our filter will completely ignore these areas of hidden text, and look just at the message the user sees.

Does this e-mail have a large number of recipients?

This gene may or may not apply, so its implementation is optional. Generally, spam messages will have a long list of recipients. But, if you receive work related e-mail, or your friend shoots an e-mail out to 20 people at a time, this no longer becomes a significant trait. This is one I leave to you to implement.

What’s the ratio of links to words in this e-mail?

Most spam will contain at least one link to their website or wherever so that you can “unsubscribe” (otherwise known as prove your e-mail is valid, and add it to 200 other spam lists). I would not consider this a dominant spam trait, but if you have 5 links and only 10 words in the e-mail body, then there’s a good chance this is spam.

What’s the ratio of misspelled words to words in this e-mail?

Because spammers will stop at nothing to hide words that will tip off a spam filter, they will often fill an e-mail with phrases like “v1agr@a” and other fun gibberish to thwart a filter. Again, I would not necessarily consider this a dominant trait of just spam (come on, we all have that friend who spells at a 2nd grade level), but combined with other traits, it might be a good indication if this is spam or not.

What’s the Bayesian spam probability of this e-mail?

If the other traits described above to not tip off the filter, this one will become very dominant. The Bayesian spam probability is what most e-mail filters to date use to filter e-mail. Basically, it examines all string tokens within an e-mail, and calculates if the token appears often in spam messages. But, because of all the nasty little tricks spammers now play, this generally only works if none of the above traits exist. But, since our filter is smart enough to look at just what the user sees, is becomes much more effective in our application.

Now that we have defined what the overall process is going to look like, let’s delve into the code.

Putting it Together

Let’s go ahead and get into the good stuff by going over the code for our application.

NOTE: The sample code for this application is in C#. C# was chosen over C++ so beginners could better see the structures of the process, and C# was chosen over Java because of the inherent performance advantages of .NET. If anyone would like to re-factor the code into C++ or Java, please do and I will include it with the distribution of this article.

The namespace defined for our engine is SpamAwareLib. Its structure is as follows:

  • SpamAwareLib
    • DataObjects
      • AddressList.cs - Represents collection of addresses and domains
      • Chromosome.cs - Collection of “genetic” traits regarding an e-mail
      • ChromosomeHistory.cs - Collection of chromosomes represents a history of chromosome data over time
      • Corpus.cs - Collection of mail tokens, the number of times each token has been encountered, and the number of messages that this corpus is based off of
      • Dictionary.cs - StringCollection of words that represent a dictionary; used for spell checking
      • Message.cs - Collection of data that represents a mail message
    • MsgProcessors
      • BayesianAnalyzer.cs - Computes a Bayesian probability if a message is spam from the body of the message, a ham corpus, and a spam corpus.
      • HTMLParser.cs - Simple HTML parser that scans a message for genetic traits specific to HTML messages
      • TokenParser.cs - Parsers a message in to string tokens, and posts the resulting tokens into a corpus and collection
    • NeuralNet
      • Neuron.cs - Neuron model for a Multi-Layer Perceptron
      • NNMLP.cs - Model for a simple Multi-Layer Perceptron
      • NNSpamAware.cs - ANN implementation of the MLP used for the spam engine
    • Utility
      • ChromosomeGen.cs - Factory that generates a Chromosome and/or executes it against the MLP
      • ANNTrainer.cs - Factory that generates our neural networks
We have four main sub-namespaces under SpamAwareLib, they are as follows:

DataObjects – As the name implies, the objects here represent units and collections of data within the engine.

NOTE: The Dictionary.cs file is the object that will contain the dictionary for spell checking. I did include the serialized version of the object along with a 178,000+ word dictionary text file. If the Dictionary.cs file is modified, you will need to reload and re-serialize the Dictionary object with the file.

MsgProcessors – This is where the primary message parsing and gene finding takes place. The ChromosomeGen object will generate a chromosome by passing a mail message through each one of the processors in this namespace.

NeuralNet – Again, as the name implies, a collection of objects within this namespace that represents our ANN and its internal data objects. This is where our chromosome gets processed to give us the net probability if the message is spam, ham, or mystery meat.

Utility – This is where our chromosome generator and neural network generator exists. Our chromosome generator takes as an input the mail message (along with the dictionary, the mail corpuses, and the neural net), and then generates the appropriate chromosome for the message, and the net probability if the message is ham, spam, or mystery meat.

To see a code representation of how the engine works, let’s take a look at the chromosome generator method GenerateChromosomeAndCheck.

public static Chromosome GenerateChromosomeAndCheck(Message msg, 
			Dictionary dict, 
			Corpus ham,
			Corpus spam,
			NNSpamAware net)
{
  string body = msg.Body;

  Chromosome stats = new Chromosome();
  
  body = body.ToLower();
  stats.IsHTML = 
    Chromosome.ConvertBoolToDouble(HTMLParser.IsHTML(body));
  stats.ContainsImgLinks = 
    Chromosome.ConvertBoolToDouble(HTMLParser.ContainsImgLinks(body));
  stats.IsValidHTMLBody = 
    Chromosome.ConvertBoolToDouble(HTMLParser.IsHTMLBodyValid(body));
  stats.ContainsHiddenText = 
    Chromosome.ConvertBoolToDouble(
      HTMLParser.ContainsInvisibleHTMLText(body));
  stats.LinkDegree = 
    Chromosome.CalcLinkDegree(HTMLParser.GetLinkCount(body));
  stats.RecipientDegree = 
    Chromosome.CalcRecipientDegree(msg.GetRecipients());

  // Run parts of chromosone against a pre-net
  // (This is an optimization)
  stats.SpamProbability = net.RunPreNet(stats);
			
  // This can be adjusted...  Calculating the misspelled word ratio and
  // any Bayesian probability is time consuming
  if (stats.SpamProbability < .66)
  {
    string parsedBody = "";

    if (stats.IsHTML > .5)
    {
      parsedBody = HTMLParser.CleanHTML(body);
    }
    else
    {
      parsedBody = body;
    }

    stats.MisspelledWordRatio = 
      TokenParser.GetMisspelledWordRatio(parsedBody, dict);
    stats.BayesianSpamProbability =	
      BayesianAnalyzer.CalcSpamProbabiliy( msg.Subject + 
                             " " + parsedBody, ham, spam );

    // Run full choromosone against a net (This is an optimization)
    stats.SpamProbability = net.RunFullNet(stats);
  }

  return stats;
}
In the first part of this code, we create a new Chromosome data object, and then we pass the lowercase body of the mail message to various message processors, that give us our basic gene information. The next step can be skipped based on your implementation, but for sake of speed, we calculate a “pre-probability” to see if the message is spam by passing the chromosome data we have at this point to our neural network. If the probability of it being spam is less than .66 (if the network tells us it’s either mystery meat or ham at this point), we calculate the rest of the chromosome, and pass it to our other neural network. We do this mainly to save time, for the simple fact the spell check and the Bayesian probability test can use up a lot of processing time. After the completion of this method, we will know if our message is ham or spam, and we should then move the message to the appropriate mail folder.

NOTE: For further details on the implementation of the message processors (gene finders), please see the code attached to the project.

Training the Networks

Before our filter will work, we need to complete an important first step, training of the neural networks on our existing e-mail. First ensure that you have two separate sources of e-mail, one for ham (good e-mail), and one for spam. Then, load the mail Corpus data objects, the chromosome history, and the dictionary object, and we’re ready to generate our neural networks. The code below illustrates this through the GenerateNetwork method within the ANNTrainer.cs file.

public static NNSpamAware GenerateNetwork(Message [] hamMsgs, 
                                          Message [] spamMsgs,
			      Corpus hamCorpus, Corpus spamCorpus,
			ChromosomeHistory hist, Dictionary dict)
{			
// First, add the tokens of our messages to our corpuses
  for (int i = 0; i < hamMsgs.Length; i++)
  {
    TokenParser.AddMsgTokensToCorpus(hamMsgs[i].Subject, 
                                     hamMsgs[i].Body, hamCorpus);
  }

  for (int i = 0; i < spamMsgs.Length; i++)
  {
    TokenParser.AddMsgTokensToCorpus(spamMsgs[i].Subject, 
                                     spamMsgs[i].Body, spamCorpus);
  }

  // Generate the chromosomes for all the mail items
  for (int i = 0; i < hamMsgs.Length; i++)
  {
    Chromosome chromosome = 
          ChromosomeGen.GenerateChromosome(hamMsgs[i], 
                                           dict, hamCorpus,
                                           spamCorpus, 0);
   hist.AddChromosome(hamMsgs[i].Id.ToString(), chromosome);
  }

  for (int i = 0; i < spamMsgs.Length; i++)
  {
    Chromosome chromosome = 
          ChromosomeGen.GenerateChromosome(spamMsgs[i], 
                                           dict, hamCorpus, 
                                           spamCorpus, 1);
    hist.AddChromosome(spamMsgs[i].Id.ToString(), chromosome);
  }

  // Perform Network Training
  NNSpamAware ann = new NNSpamAware();

  // Train for 1,000 epochs.  A better method would be to look for an
  // acceptable MSE, but for simplicities sake, we train for a set 
  // number of epochs
  for (int epoch = 0; epoch < 1000; epoch ++)
  {
    for (int i = 0; i < hist.Count; i++)
    {
      ann.TrainPreNet(hist[i]);     // Train the pre Net
      ann.TrainFullNet(hist[i]);	// Train the full Net
    }
  }

  return ann;
}
As the code depicts, training our networks is a three stage approach. First, we add the tokens of the messages we’re going to train to their respective corpus. Once we have done this, we can then generate a chromosome for each message, which we add to our chromosome history. Finally, we pass chromosomes from our history into the training procedure of our neural network, which in turn causes the network to learn what chromosomes represent spam.

Results

The engine presented in this article has proved to be extremely accurate in the identification of spam. In testing, the model has shown to have one false negative per 1000 e-mails. With further tweaking and with the identification of more genes, I’m quite sure this could be increased to about 1 in every 5000 e-mails or greater.

Conclusions

Obviously, spam cannot be eliminated with just an algorithm or an approach. To effectively kill the practice of spam, first and foremost, people must stop buying things from spammers. All products, companies, and organizations who partake in spam should be boycotted, and a good start would be a list of this information, and posting of it on the internet.

With that said, a sucker is born everyday, and certain individuals will purchase items from spammers no matter how shady or underhanded the promotion is. As long as these people continue to buy from spammers, and the cost of spamming does not escalate, spamming will continue.

To combat this problem, the “see-no-evil, buy-no-evil” approach can be taken. With this, vendors need to include EFFECTIVE anti-spam tools within their e-mail applications. And not just in new versions, but also through the development and deployment of service packs and free add-ons to existing and previous product lines. The more spam messages the user is oblivious to, the least likely he or she will partake in a spam promotion. This in turn makes spam less effective, significantly decreasing the return on investment of the practice, which will in turn greatly reduce it and/or kill it.

Last but not least, no single, elementary solution can be used to effectively filter and kill spam. It will take a solution that recognizes spam as an organism, and that can evolve as quickly as it can.

Contact Information

If would like to contribute feedback, alternative methods, code, a high-paying job, etc, you can contact me at jarhead4067@hotmail.com (assuming the spammers don’t kill my account first).

Submitted: 02/03/2004

Article content copyright © Shawn Evans, 2004.
 Article Toolbar
Print
BibTeX entry

Search

Latest News
- Generation5 10-year Anniversary (03/09/2008)
- New Generation5 Design! (09/04/2007)
- Happy New Year 2007 (02/01/2007)
- Where has Generation5 Gone?! (04/11/2005)
- NeuroEvolving Robotic Operatives (NERO) (25/06/2005)

What's New?
- Back-propagation using the Generation5 JDK (07/04/2008)
- Hough Transforms (02/01/2008)
- Kohonen-based Image Analysis using the Generation5 JDK (11/12/2007)
- Modelling Bacterium using the JDK (19/03/2007)
- Modelling Bacterium using the JDK (19/03/2007)


All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -