Letters

2010-11-01

Editor: Helen Martin

Abstract

Letters to the editor on the relevance of spam feeds and the earning potential of cybercrime.


The relevance of spam feeds?

Building a good spam corpus is really important to validate ideas, to develop filters and to evaluate them. There is little literature available on the subject, so it was interesting to read the article ‘On the relevance of spam feeds’ in last month’s issue of Virus Bulletin (see VB, October 2010, p.21), and to see that there are people working in this area.

However, I was a little disappointed with the article. In the introduction, the authors say: ‘If the filters are not trained to detect a specific type of message, whether directly or indirectly, odds are that they won’t detect any subsequent similar ones.’

The main idea behind some types of filters is to separate ham and spam messages. The main idea behind other types of filters is to detect and identify each kind of spam. It seems that the authors’ filter belongs to the latter category. The two ideas lead to very different kinds of filters and to very different approaches to building a corpus of messages.

In the machine-learning community one efficiency parameter is the ‘generalization ability’. This refers to the ability to learn from a small number of samples and classify newer ones which have never been seen before (this is explained in Vapnik’s book and others).

This parameter depends heavily not only on the type of classifier but also on how the learning task is carried out and on the statistical characteristics of the incoming flow. There are usually a good number of messages to learn from – not too few and surely not too many.

Having a large number of messages is interesting, but is more useful for testing a classifier than for feeding the learning task.

In both situations (learning and testing), the spam feed should be representative of the real incoming flow.

The authors write about ‘pollution’, or errors, in the training corpus. The sensitivity of classifiers to errors in the training corpus varies. This depends not only on the kind of classifier but also (and mainly) on how they are trained. A number of papers have been written about this (D. Sculley, Gordon Cormack, Alexander Kolcz and John Graham-Cumming). For example, training methods known by the acronym TUNE (Train Until No Errors) will generate overfitted classifiers which are very sensitive to errors in the training corpus. It’s good to remove errors, when they’re found, but it’s also interesting to evaluate the expected error rate.

Although I’m using vocabulary from ‘statistical/machine learning’-based filters, the idea is valid for any kind of filter.

Next, the authors concentrate on the elimination of newsletters from the spam corpus. One should think a little more about this point. Should they really remove all newsletters from the spam feed or try to understand why there were a lot of newsletters in it? If the user of a spam feed is interested in understanding how newsletters (and ‘grey mail’ in general) are handled, they should remain there, and perhaps be labelled in a way in which they can easily be identified.

I was hoping to find an explanation of good coverage of the spam spectrum. The authors say something about it in the evaluation section, but this looks more like a recipe specific to their filters than scientific methodology for building a general purpose spam feed.

At one point the authors describe splitting the spam feed into a large number of clusters, each one related to a spam campaign. This is also something specific to their filter, and not useful in terms of building a general-purpose spam feed.

The authors conclude with the phrase: ‘None of us can filter spam we do not receive ...’ Again, this is a generalization of what they think a good filter should be. Maybe it should have read: ‘Spam filters like ours can’t filter spam we do not receive.’

In my opinion, if someone wants to create a good spam feed – which will be useful to many people – the best approach is to collect all spam messages arriving at a number of places, without any filtering, and without mixing them. In other words, the spam feed should be a statistical sample of all spam messages seen in a particular place. The spam feed should maintain the distribution of messages per genre (pharmacy, pornography, and so on). Users of the corpus will be able to adapt it to their needs: for training or testing filters, or for analysing spam traffic. In my humble opinion, this is the best way to build a spam feed that would fit the needs of everyone.

Jose Marcio Martins da Cruz, École des Mines de Paris

Response

An obvious difference in perception between the authors and Mr Martins da Cruz derives from the fact that he is an academic, while we are approaching the issue from a non-academic standpoint. As most readers know, this leads to very different goals and thus different reasoning. While the industry is interested in filters that have high accuracy and a short response time, most academics that I have met are looking for an elegant solution and a strong theoretical framework. Mr Martins da Cruz states that ‘building a good spam corpus is really important to validate ideas, to develop filters and to evaluate them’. While that may have been the case a few years ago, when everyone was looking for solutions to filter spam, from the current perspective I disagree – a relevant spam feed is needed to filter spam. The ideas may have been validated long ago, but spam is continually changing and spam tracking must keep up.

In the article we presented a method for evaluating a spam feed given the context created by other previously acquired spam feeds – to compute their overlap. It’s an advantage to be able to measure the impact of a new feed prior to using it for a long period (and thus also usually paying for it). The questions we were trying to answer in this paper were ‘How interesting is this spam feed for me?’ and ‘How can I make it better?’.

As Mr Martins da Cruz observes, we do assume that a message clustering method is in place (which is generally the case), but this does not mean that the clustering method is also a filtering method. The idea of clustering messages can conceptually be separated from the task of filtering ham from spam. The aim is to generate a good feed that can serve any type of filtering.

Although the direct usage of such a method becomes obvious when trying to buy a spam feed, it is also a method that is used to reduce the number of messages processed while losing the fewest possible representative concepts. The processed spam volume may not be a concern when tests are conducted on several thousand messages, but when spam feeds to be processed are measured in their millions per day, the concern is a major one.

The issues raised regarding last month’s article refer to the processing of that feed, which is viewed as only interesting from a partitional clustering point of view, but moot from a hierarchical one. It is stated that ‘In both situations (learning and testing), the spam feed should be representative of the real incoming flow.’ While we agree with that point of view, we strongly disagree with the statement ‘Having a large number of messages is interesting, but is more useful for testing a classifier than for feeding the learning task.’

Generalization ability for most clustering techniques (hierarchical AND partitional) is dependent on the representativeness of the concepts with which the system is trained. However, once the filtering method is chosen, the greater the number of different concepts the classifier is trained with, the higher the probability of obtaining a good cluster set at the end. A good initial overview is always preferable to having to infer from limited traces of data.

But the volume of spam is overwhelming for most machine-learning tasks. In attempting to reduce the sheer volume of samples that must be filtered, there is a significant risk that some existing spam types will no longer be represented. This is the undesirable outcome we’re trying to limit.

From a machine-learning point of view, one might argue that industrial spam filtering now is even less exciting and challenging than it was five years ago. Simpler filters seem to work, whereas more sensitive, complex ones are being forgotten. This statement is, of course, subject to challenges from our peers. But a simple look at Spamhaus’s performance in the latest VBSpam tests (see VB, September 2010, p.22) shows the resilience of extremely simple filters that possess the capacity to process huge input volumes. The number of articles published and talks presented about spam filters over the last few years is also a good indication that many have given up on finding machine-learning techniques that have 99.5% accuracy and zero false positives. Bayes filtering doesn’t work any more – at least not at the level that is required to pass tests. Although at some point in the past it did show promising results under laboratory conditions, given a small number of samples, it now fails the reality test.

Moving on to the issue of whether or not newsletters should be left within the spam feed, we must underline the fact that, depending on the way the spam feed is gathered, legitimate messages may be coming in along with spam, and that is a known nuisance. We do understand why newsletters are being mixed up with spam – as already argued in the article, one must emulate a real user in order to receive high-quality spam. But the problem with using those feeds in their initial form is exactly like using an annotated corpus for a static test when you know the annotators have mislabelled 5% of the items in there. This induces a level of uncertainty that is unwarranted, and any reduction of the percentage of misclassifications is welcome.

We thank Mr Martins da Cruz for his feedback,

Claudiu Musat, BitDefender

Is cybercrime a bigger money earner than drugs?

In the latest editorial (see VB, October 2010, p.2) you mention the rumour of cybercrime being bigger than the drugs trade (‘...today, the profits generated by cybercrime worldwide are rumoured to match the revenues yielded by the illegal drugs trade’). Unfortunately this is something that keeps being repeated by people (and so some are beginning to believe it), but it’s actually utter nonsense.

For more details see: http://www.theregister.co.uk/2009/ 03/27/cybercrime_mythbusters/.

Graham Cluley, Sophos

Mea culpa

Thank you Graham. Ed.

twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.