Igor Muttik McAfee
Heuristic detection methods are well established and well known in the AV industry. In the end they usually produce a
decision as to whether the analysed object is suspicious enough to be detected or not.
As opposed to such traditional heuristics, in this paper we explore the possibilities of using 'weak' heuristics based on
data mining technologies in malware analysis. By weak heuristics we mean utilizing a large set of simple features which
generally would not qualify as elements for traditional malware heuristic routines (the examples include features like
count of sections in a PE file, number of imports, depth of resource tree, size, time stamps, etc.). In fact, some of them
may be 'negative' heuristics. 'Weak' heuristics based on such general features may not be suitable for detection but they
are still very useful in AV operation.
We describe how data-mining techniques can be used to provide such 'weak' heuristics, we compare the properties of
decision trees/forests and support vector machines in application to malware analysis.
We discuss the selection of the features extracted from a sample (or obtained by a behavioural product when a program
runs), the construction of a primitive decision tree from these features and the properties of the ROC (receiver operating
characteristic) curve. The same decision tree can be used at its strong and weak points on the ROC curve and used to drive
different decisions in security products.
We list possible uses of weak heuristics:
del.icio.us digg this
- to prioritize and de-prioritize samples in research queues
- to drive the depth of the sample analysis (emulation, behaviour monitoring) on the endpoint, gateway or a server
- to check the most suspicious samples using cloud-based security