Malware mining

Igor Muttik McAfee

Heuristic detection methods are well established and well known in the AV industry. In the end they usually produce a decision as to whether the analysed object is suspicious enough to be detected or not.

As opposed to such traditional heuristics, in this paper we explore the possibilities of using 'weak' heuristics based on data mining technologies in malware analysis. By weak heuristics we mean utilizing a large set of simple features which generally would not qualify as elements for traditional malware heuristic routines (the examples include features like count of sections in a PE file, number of imports, depth of resource tree, size, time stamps, etc.). In fact, some of them may be 'negative' heuristics. 'Weak' heuristics based on such general features may not be suitable for detection but they are still very useful in AV operation.

We describe how data-mining techniques can be used to provide such 'weak' heuristics, we compare the properties of decision trees/forests and support vector machines in application to malware analysis.

We discuss the selection of the features extracted from a sample (or obtained by a behavioural product when a program runs), the construction of a primitive decision tree from these features and the properties of the ROC (receiver operating characteristic) curve. The same decision tree can be used at its strong and weak points on the ROC curve and used to drive different decisions in security products.

We list possible uses of weak heuristics:

to prioritize and de-prioritize samples in research queues
to drive the depth of the sample analysis (emulation, behaviour monitoring) on the endpoint, gateway or a server
to check the most suspicious samples using cloud-based security
etc.