Is your spam filter really adaptive? (Probably not)

2008-03-01

Jonathan Zdziarski

Secure Computing, USA

Editor: Helen Martin

Abstract

Jonathan Zdziarski describes adaptive parsing - a technique that can be applied to a spam classifier’s parser to allow it to learn the most effective way to parse a corpus of mail.

Table of contents


Blinding the parser
Adaptive parsing
Bootstrapping
End result

It has been several years since I first entered the spam scene, and during my this time I have observed carefully how machine learning has degenerated – seemingly unnoticed – into a cat and mouse game between spammers and filter authors.

Ironically, this is what we thought we had avoided when statistical content filters went mainstream. At the time, most of us believed that Bayesian filters had the moxie to fight our battles for us. This was because these filters adapted to new forms of content on their own, leaving our customers only to push the ‘spam’ button. What used to mean having to update rules in heuristic filters became a service performed automagically by our statistical classifiers instead. This technique did, in fact, work for several years, and as a result filter authors became lazy. Reality has hit us in recent years, though, and forced many of us to sit up and observe the different spammer techniques that have effectively (and in a seemingly demoralizing manner) rendered many adaptive filters helpless. Sure, Nigerian diplomat spam isn’t bothering us any more, but image spam, ASCII spam, spear phishing, and (in one case) Klingon spam seem to put our spam filters through hell.

Blinding the parser

As mentioned, adaptive filters are capable of learning new forms of content. This was the holy grail of geek research when these filters first hit the scene. The inherent problem has been how to define what exactly content is and how to present it to the classifier so that the filter can adapt to it.

One key component of every spam filter is its parser (or tokenizer). The parser is responsible for ‘reading’ incoming mail and turning it into a set of features that the filter uses to build a hypothesis space (its logic). In its simplest operation, the parser might simply pull out every word found in an email, making the word selection based on where the spaces are, and call each word a feature. The parser represents the eyes of a spam filter, and nearly every recent attack by a spammer can be traced back to a weakness in parsing.

Whether spammers are smart enough to target the parser intentionally is unknown, but the likely reason that so many flaming arrows are shot at it is because it is the weakest part of even a well-written spam filter. Spammers have long given up trying to attack other parts of the filter, having become wise to the fact that it is very difficult to attack mathematics. Throwing random words or excerpts from books into an email has historically proven a glorious waste of time, and so in recent years, spammers have turned their attention from confusing the spam filter to trying to blind it instead.

In order to blind a spam filter, you have to know how one sees. Unlike a classifier’s machine-learning components, the spam filter’s parser is one of the few pieces that is written to have a predetermined set of rules – parsing rules, that is. In other words, the filter’s creator hasn’t trained the spam filter how to read, but rather told it how to read. Since every copy of a particular piece of spam-filtering software will read in the same way, spammers can easily – through trial and error – figure out how to hamper the filter’s ability to do its job. If you can prevent the parser from distinguishing content, you can prevent it from making any sense of it. In human terms, this would be the equivalent of speaking Pig Latin to a human who was just learning the English language. While thoroughly amusing, it would take the human some time before realizing they should smack you. A well-written parser attack can keep the computer in a confused haze for long enough to get spam through.

To add insult to injury, parsers also lack the practical ability to read more than a few different languages effectively. Most are limited to parsing the character sets for which they were designed, as many languages use basic rules such as white space to determine word breaks. This presents a rather difficult problem when trying to classify mail in Chinese or Korean, which has no white space, or when analysing languages that have special accents. Moreover, if you’re classifying PDF files or other quasi-binary content, the filter will have no way of parsing its binary components intelligently. In order to account for more complex languages and formats, the filter author himself would need to have foreknowledge of what languages he is supporting. Consider the following parse-o-grams, and you can get an idea of what Chinese email must look like to an English-‘speaking’ spam filter:

THEREDWELLSAMISSLATE        ENDANGERSPARSEAMANSWORDS
THE RED WELL, SAM IS LATE   ENDANGER! SPAR, SEAMAN SWORDS!
THERE DWELLS A MISS, LATE   END ANGERS; PARSE A MAN’S WORDS
(G. Sinnamon)               (D. Higgs)

Even though machines have proven their amazing capacity for learning, we still seem to stick with using static parsing rules in an otherwise statistical classifier. However, the computer can teach itself how to read better than the human who programmed it can, if only given the chance.

Adaptive parsing

Adaptive parsing is a technique applied to a classifier’s parser to allow it to learn the most effective way to parse a corpus of mail. Notice I didn’t say the best way or the most intelligible way, but the most effective. ‘Effective’ here could be defined as ‘rendering the most descriptive data’ to the rest of the classifier. This is done by instructing the parser to measure the quality of its own output and reprogram itself when necessary. Data can be considered to be high quality when its occurrence within the corpus results in it having a very high or a very low probability – something that could potentially affect the result of the classification. On a Bayesian filter, good quality data could be considered to be somewhere in the neighbourhood of 0.0 – 0.1 and 0.9 – 1.0.

To find the best parsing technique, a hypothesis space of parsing semantics is built. From this, the parser can measure which techniques work best for whatever data has historically been presented. If the filter finds that it is not generating as useful data as it could, then it can reprogram itself dynamically to use a different set of parsing techniques. The change of techniques will cause the data to be extracted differently – inevitably either improving or degrading its results.

The most simple and natural form of parsing technique is simply distinguishing between a token separator and a constituent character. As a simple example, let’s take all possible byte values (that’s 0x00 to 0xFF) and turn them into a hypothesis space. These represent the 256 possible bytes that could be used by the filter as either token separators or constituent characters. Most spam filters will generally use only a handful of these, such as white space, punctuation, and the like. Here, all 256 possible characters are analysed within our hypothesis space, so we can later choose the ones that we think will give us the best results. In reality, this could be extended to include wide characters, or if you want to get fancy, kGrams of bytes.

As mail comes in, each feature is analysed by whatever the live parser configuration is using (we’ll get to bootstrapping shortly). As part of the filter’s basic operation, it will assign each feature a probability. A Bayesian filter might end up leaving you with:

 THE        0.55
 HUGE       0.80
 FICKLE     0.02
 FOX        0.03
 LEFT       0.97

Let’s make some observations about this data. Using our example thresholds of 0.0 – 0.1 and 0.9 – 1.0, the letter ‘F’ seems to appear only in interesting data – that is, the words ‘FICKLE’, ‘FOX’, and ‘LEFT’. Similarly, the letter ‘H’ only appears in uninteresting data: ‘THE’ and ‘HUGE’, which both have less than impressive probabilities. The letter ‘E’ appears equally in both, as does the space character, which is unseen.

If we were to use only this data to write a set of parsing rules, we could easily put together an order of preference from uninteresting characters to interesting ones:

Uninteresting     »     »     »     Interesting

H     U    [SP]    E    T     O    X     F

To program a parser, you would naturally want to use the characters that were found to be in the least interesting data as token separators first. Why? Because characters like ‘F’ and ‘X’ have a very high probability of appearing in useful data – you want those characters to be constituent characters.

The parser can now be reprogrammed to use the space, ‘H’, ‘U’, and ‘E’ for the next message processed. This will cause the parser to take otherwise uninteresting data and ‘read’ it differently in a way that might render more useful tokens. If successful, the characters present in these new tokens will rise to the top as more interesting, and be used as constituent characters. If the data it generates becomes less interesting, then an even less useful set of characters will be used as token separators.

Of course, the data you wind up with may make less sense to a human, but what’s important is that the classifier sees it to be of great interest. Here are a few real-world examples:

 [0.940828] igh (105s, 2i)

  |-|igh, High, H-IGH Interest Mortgage.

 [0.990000] $888 (15s, 0i)

  Yup, dollar signs are more useful.

 [0.990000] ional_Inc.+Now (6s, 0i)

 [0.990000] s0r+C|ubs (12s, 0i)

  Variants of spam that would otherwise be ‘uninteresting’.

In the statistics world, we’re essentially assigning a probability to each character for each token the parser examines. The probability can be assigned to each character using Paul Graham’s original idea in ‘A Plan for Spam’ (http://www.paulgraham.com/spam.html):

Instead of measuring the probability of spam vs. non-spam, we’re now measuring the probability that the character will appear in useful data. And, unlike its original use where totals were based per message, the totals here are assessed on a per-token basis.

Bootstrapping

Every time a message is processed, features are extracted using whatever the least effective set of characters happen to be, and the rest are treated as constituent characters. These could be the best N candidates or a specific threshold could be used to determine how many characters to use. What we haven’t figured out yet is how to get this thing started. There are a few schools of thought with respect to bootstrapping:

The first is to use a predetermined set of delimiters until enough data has been recorded to be confident in programming the parser adaptively. The drawback of this approach is that we don’t know whether the default delimiter set will be able to parse effectively at all.

The second approach is simply to assign random probabilities, in much the same way as a neural network is initialized, and let the parser work out its own programming with training. Depending on the corpus, this could work very well for certain character sets.

Other techniques may work better. The idea is to get the parser going with some instruction on how to parse the first messages. This will lead the parser down the road of generating better programming for itself.

End result

The philosophy here is to teach the parser how to parse by measuring its own ability to generate useful data. Statistically speaking, we are trending the presence of good data with relevant parsing rules. However, what goes into the hypothesis space could really be anything. Human-based rules for parsing certain languages could easily be included, as could other large sets of parsing semantics. With a little more complexity, this could be used to parse binaries statistically. Instead of measuring just what characters to parse on, one could also measure whether to parse before, on, or after the character, giving 768 possible parsing rules to calculate.

Eventually, the parser ends up with rules that look nothing like human rules, giving the filter data that looks nothing like human data – but if the classifier can make a more informed classification as a result of this, then that’s perfectly all right with me.

This article is based on a paper presented at the 2008 MIT Spam Conference. See http://spamconference.org/.

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive