Improving heuristics

2008-08-01

Newaz Rafiq

Zheng Group, Paretologic, Canada

Yida Mao

Zheng Group, Paretologic, Canada

Editor: Helen Martin

Abstract

Heuristic detection can provide valuable assistance to help security analysts in achieving zero-day malware detection. Newaz Rafiq and Yida Mao discuss a novel heuristic detection technique with a high level of accuracy and a high level of adaptability to meet the challenge of new malware.

Table of contents

Model

Feature extraction

File size
Obfuscation
Sections
Anomaly
BHOs
Services
Imports

Feature selection

Automatic decision making

Evaluation of our system

Fine-tuning of parameters

Experimental result

Case study

Conclusion

With proven accuracy, predictability, performance and scalability, heuristic detection can provide valuable assistance to help security analysts in achieving zero-day malware detection. In this article we will discuss a novel heuristic detection technique with two major advantages:

A consistently high level of accuracy in malware prediction.
A high level of adaptability to meet the challenge of new malware.

Model

Our approach starts with a model that resembles the behaviour of our security analysts.

To make a prediction about a sample we need to extract features from it, just as security analysts collect features from the sample executables. Analysts have prior knowledge of malware features. They know which features characterize malicious behaviour and which indicate non-malicious files. They decide whether a sample is malicious or not based on their prior knowledge of its features. But some of the features will be new to the analysts, in which case they upgrade their prior knowledge by adding details of the new features. Our system works in exactly the same way. Figure1 shows the model on which our automatic file classification system is based. As an automated heuristic approach alone cannot be relied on to give 100% accurate detection, a manual check is incorporated before committing any new features to a knowledge base.

Figure 1. Automated file classification system.

Feature extraction

Features can be extracted from the static and run-time behaviours of malware samples. We are able to extract hundreds of features from each executable; some of the most notable ones are described here:

File size

File size has been shown to be an important feature both in our investigations and in other studies [1]. In our initial experiments we divided executables into three groups based on their file size:

Group 1: executables whose file size was smaller than 1 MB.
Group 2: executables whose file size was smaller than 5 MB and greater than or equal to 1 MB.
Group 3: executables whose file size was greater than or equal to 5 MB.

After normalizing the counts in each group, we arrived at the results shown in Table 1.

Group	Malware (%)	Non-malware (%)
1	53	47
2	58	42
3	3	97

Table 1. File size statistics.

According to Table 1, samples contained in groups 1 and 2 have an approximately equal chance of being malicious or non-malicious, thus the file size does not reveal any useful information for malware detection. However, executables belonging to group 3 (file size > 5 MB) are significantly more likely to be non-malicious than malicious.

Obfuscation

In our investigations we divided the executables into two groups: obfuscated and non-obfuscated. Obfuscation can be achieved by packing the full sample or a portion of the sample binary, by reordering instructions, and so on. We found that approximately 60% of recent malware is obfuscated. We determined that if an executable is obfuscated, there is a greater than 95% probability that it is malware.

Sections

An executable consists of sections, such as header, text, code and so on. There are generally fewer sections in malicious files than in non-malicious ones. In our analysis, more than 70% of the malware samples consisted of two or three sections, while more than 70% of non-malicious files consisted of four or five sections. In further analysis focusing on section names, we found that over 80% of malicious programs used unconventional section names, whereas only 3% of non-malicious programs used unconventional names. We also found that some executables used duplicate section names, although this was very rare (only 4%). If there is a duplicate section name, then there is a more than 95% probability that the executable is malware.

We found that use of the resource section (.rsrc) was a good indicator of a sample being malicious (with more than 70% probability), the presence of read-only data (.rdata) meant that the sample had a greater than 70% chance of being non-malicious, and the presence of import data (.idata) was also a good indicator of the sample being non-malicious (with more than 80% probability).

Anomaly

Another notable feature relates to peculiarities in the executable structure – for example, some sections in the executable may not be aligned properly. In our analysis, more than 78% of malware revealed an anomaly in the executable structure, while only 5% of non-malicious samples had an anomaly in their structure. If an anomaly exists, there is a more than 93% chance that the sample is malicious.

BHOs

Browser Helper Objects (BHOs) are program modules (DLLs) designed as plug-ins to provide added functionality for Microsoft’s Internet Explorer web browser [2]. BHOs have access to all the events and properties of a web-browsing session [3]. This means they give developers almost complete control over Internet Explorer functionality. For malware writers this is a compelling reason to use BHOs.

According to our analysis, if an executable uses a BHO, it can likely be classified as malware with 98% probability.

Services

Services are employed to enable long-running executable applications to run in their own Windows session [4]. These services can be started automatically when the computer boots, can be paused and restarted, and do not require a user interface. Services start when the Windows operating system is booted and they run constantly in the background as long as Windows is running. Services can run for a specific user account that is different from the logged-on user or the default computer account.

According to our analysis, if an executable runs as a service, it can likely be classified as malware with 98% probability.

Imports

As part of our investigations we also calculated statistics relating to the importing of DLL files. For example, if an executable imports system32.dll, then the sample has a more than 77% chance of being malware and if it imports kernel32.dll, then the sample has a more than 67% chance of being malware.

Feature selection

The accuracy of malware detection depends heavily on the selected features on which predictions are made [5]. Figure 2 shows our experimental results using two different feature selection algorithms.

Figure 2. Detection rate as the number of features varies.

From Figure 2 we can conclude:

An increase in the number of features does not guarantee better detection.
A feature selection algorithm should be chosen carefully.

To understand how feature selection helps in the malware detection process, assume that we have 500 items, of which half are malicious and half are non-malicious. These will be used to train our system. Also assume that we have detected three features: A, B, and C, for each of the 500 samples.

From our statistical analysis, we obtain the information content of each feature, as shown in Table 2.

In our model, we assign samples a ‘likelihood’ score. The closer the likelihood score is to one the more likely it is to be malware, and the closer the score is to zero the more likely it is non-malicious.

Feature	A	B	C
Information	0.10	0.90	0.80

Table 2. Information content of three features.

Now assume that an executable X has two features: A and C. The likelihood scores for X according to the features selected are given in Table 3.

Table 2 indicates that more information can be drawn from feature C than from feature A. This is also reflected in Table 3. If the feature selection algorithm selects A, then the likelihood score for X is 0.49, which is inconclusive. A similar score is achieved when two features, A and C, are selected for the adjudication process. But if feature C alone is selected the likelihood score is 0.88, which tells us that X is malware.

Feature	A	B	C	A,C
Likelihood score	0.49	0.10	0.88	0.43

Table 3. Likelihood scores for X according to selected features.

For this reason, feature selection is very important for malware detection. We have devised a few simple and time-efficient techniques to select the most informative features that produce a high accuracy of malware predictability. Some of these have been published in our previous work [6].

Automatic decision making

There are many classification algorithms at our disposal. Currently we are using the naive-Bayes classification algorithm as it is both accurate and simple to implement. The simplified algorithm (assuming that there are only two classes: malware and non-malware) is given in Equation (1).

Where x = [x₁, x₂, · · · , x_n] is an array of selected features from an executable, P(c|x) is the a posteriori probability that the executable with feature set x is in class c, and P(x|c) is the probability of x occurring in class c.

Evaluation of our system

To evaluate our system, we use the following quantities:

True positive (TP): the number of malicious files classified as malware.
True negative (TN): the number of non-malicious files classified as non-malware.
False positive (FP): the number of non-malicious files classified as malware.
False negative (FN): the number of malicious files classified as non-malware.
True positive rate (TPR):
False positive rate (FPR):
False negative rate (FNR):
Detection rate (DTR):

Fine-tuning of parameters

K-fold cross validation is one way to determine the characteristics of an algorithm. In this technique, the data set is divided into k subsets. One of the k subsets is used as the test set and the other k -1 subsets are merged together to form a training set. The advantage of this technique is that each sample contributes to the system performance.

We fine-tuned several parameters using the cross-validation technique, but we describe only one of them here: number of features.

To begin, we used around 7,000 known executables (54% of which were malware) to train our system and to fine-tune the initial system parameters. We varied the number of features from five to 30 and plotted the results as shown in Figure 3.

Figure 3. Detection rate as the number of features varies.

As can be seen in Figure 3, our detection algorithm produces the best DTR when the number of features is 15, the best FPR when the number of features is 20, and the best FNR when the number of features is 10. For this reason, we experimented with our algorithm using newly detected malware samples when the number of features was 15. The results are described in the following section.

Experimental result

We used one group of non-malware and 28 released malware groups that had been detected by our analysis team in recent months. Each group contained around 150 to 300 samples. We plotted the results of our experiment in Figure 4. A smooth, dashed curve shows the recognition pattern. In almost all cases, the malware recognition rate is above 90%. As the automatic decision-making system is trained using more malware samples, the system utilizes more features and accuracy continues to rise to 100%. Our system is currently recognizing non-malware with more than 90% accuracy.

Figure 4. Detection rate across malware groups.

Case study

To gain an understanding of why our system is not 100% accurate, we have referenced the features of two malicious and two non-malicious samples in this section. We consider only those notable features that were described earlier. The features shown in bold are malware-characterizing features and the rest are non-malware-characterizing features.

Malware sample 1: number of sections = 2, no resource usage.

Malware sample 2: kernel32.dll, anomaly, no. of sections = 5, import data.

Non-malware sample 1: kernel32.dll, user32.dll, anomaly, no. of sections = 5, read-only data.

Non-malware sample 2: kernel32.dll, unconventional name, anomaly, obfuscation, import data, read-only data.

From the above information we can conclude that each malware sample has some malware-characterizing features. However, non-malware-characterizing features overpower the effect of malware-characterizing features. The same is true for non-malware. This means we are very unlikely to achieve 100% detection. However, by using diverse features and a more interesting feature selection algorithm we can attempt to achieve a close to perfect detection rate.

Conclusion

The main features of our automatic file classification technique are as follows:

The ability to extract hundreds of features.
An intelligent feature selection algorithm.
The ability to fine-tune system parameters.
The option to update the knowledge base easily.

We are consistently getting more than 90% accuracy detection of malware. The FPR of our system is around 10% and we are trying to reduce this by extracting new features and by developing a new feature selection algorithm.

Bibliography

[1] Lu, B. A deeper look at malware – the whole story. Proceedings of the 17th Virus Bulletin International Conference, 2007, pp.9–17.

[2] http://en.wikipedia.org/wiki/Browser_Helper_Object.

[3] http://www.spywareinfo.com/articles/bho/.

[4] Introduction to Windows service applications. http://msdn2.microsoft.com/en-us/library/d56de412(VS.80).aspx.

[5] Goodman, S.; Hunter, A. Feature extraction algorithms for pattern classification. Proceedings of Ninth International Conference on Artificial Neural Networks, vol. 2, 1999, pp.738–742.

[6] Rafiq, A. N. M. E.; Mao, Y. A novel approach for automatic adjudication of new malware. Proceedings of The 12th World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI 2008 (to be published).

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive