Data tainting for malware analysis – part two

2009-11-01

Florent Marceau

CERT-LEXSI, France

Editor: Helen Martin

Abstract

In this three-part series Florent Marceau studies the use and advantages of full virtualization in the security field. Following an introduction to full virtualization in part one, this part looks at the limitations of the technology.

Table of contents


Full virtualization
Tainted tags propagation policy
Conclusion

In this three-part series Florent Marceau studies the use and advantages of full virtualization in the security field. Following an introduction to full virtualization in part one (see VB, September 2009, p6), this part looks at the limitations of the technology.

Full virtualization

Many previous studies have been published on the subject of full virtualization and its limitations. In particular, virtualization detection has been the subject of many publications [1]. For semi-assisted virtualization with a hypervisor, detection is mainly focused on detection of the hypervisor itself by looking for the relocation of key structures of the operating system, such as the IDT (Interrupt Descriptor Table), the GDT (Global Descriptor Table), etc. (c.f. [2]).

The main and inherent drawback of full virtualization is the need to thoroughly implement all the characteristics of the architecture. This implementation is sometimes incorrect or incomplete, and then becomes detectable. This kind of problem can be exploited to develop detection codes for targeted virtual machines [3], [4] simply through the use of instruction sequences that will react differently on a real CPU and on an emulated CPU. For example, you can use the FPU (Floating Point Unit) mnemonic ‘fnstenv’ to push the FPU environment into memory, and then use the FPUInstructionPointer field of this structure as the address of the last FPU opcode. A real CPU will respond normally, but on an emulated CPU such as a Qemu, this generally unused field is never updated. It is thus possible to detect the presence of the emulator.

A much simpler but very popular method of detecting virtualization is simply to check for the names of the manufacturers of the different devices. The Qemu hard drive, for example, is named ‘QEMU HARDDISK’ by default – this easily reveals the emulator’s presence (and is also easy to remedy). Moreover, some detection kits have been seen shared in the malware community. These will compare the serial number of the Windows host with the serial number of some well-known public sandboxes, such as Anubis or CWSandbox. Another very common method is to monitor the potential activity of an active user (mouse movement or other). Finally, one common countermeasure – targeting all kinds of automated analysis platforms – is to let the code sleep for a long period (an average of 20 minutes) to break out of the analysis time frame.

Tainted tags propagation policy

Data tainting is a mechanism that allows us to track the full propagation of a given set of data on an information system. A full description was given in part one of this series (see VB, September 2009, p.6).

We must now define a tainted tags propagation policy. This policy is directly dependent on the potential processing that could be applied to the data we want to track. We must apply our chosen policy by modifying each opcode handler on the virtual CPU.

There are two major considerations in the definition of this policy, which are the type of data to track (e.g. code, confidential data or other data) and the potential processing the data may go through (e.g. obfuscation, encryption).

Given the context and previous considerations, the heavy transformations imposed on our data through obfuscation and encryption could result in the loss of the tainted marks of legitimately marked data. It would therefore be logical to apply a more persistent propagation strategy to preserve the maximum number of tainted tags. Unfortunately, a propagation policy that is too permissive will create illegitimate tainted data, generating taintmap pollution which leads to many false positives. This pollution will consult a lot of binary data, strictly speaking ‘viral’, that will massively pollute the output and drown our configuration file in garbage, making its use almost impossible. Pollution will also be caused by legitimate data owned by the exploitation system that has been merged with viral data. Here, we have to find a compromise. Later, we’ll establish the configuration file characterization, but for the moment we are just searching to keep a consistent propagation on the viral binary.

Many previous studies have been published on the analysis of execution and data flow, sometimes using static analysis, sometimes dynamic, and sometimes both. Some of these studies are applicable only to high-level languages while others can be applied to closed-source binary. In our case we are limited to dynamic analysis for closed-source binary.

Let’s examine an overview of the theory.

A perfect propagation is exclusive; this means that strictly all data to be monitored is marked as tainted and none other. How do we define ‘monitored’ data?

A lot of studies of data tainting consider the problem from a confidentiality point of view: private data is marked as tainted and the propagation mechanism is used to ensure that this data can’t illegitimately leave the SI (information system). From this point of view, we want to assure the SI security integrity and, as defined in the non-interference model created by Goguen and Meseguer [5] (dealing with both users and data), the data must under no circumstances interfere with other data that is not explicitly allowed by the SI security policy.

Although our context requires exactly this kind of non-interference between the data to be monitored and the other data, the type of data we wish to track is radically different: our data is composed of the malicious software data and executable code (and there is no requirement for confidentiality). If in both cases the propagation should ideally be applied without loss, our tainted data has a much larger surface of interaction with the exploitation system, especially because the code will be executed with administrator privileges, allowing partial or complete corruption of the operating system by the malware. Thus, our monitored data clearly breaks out of the established security policies. Worse, since some of our tracked data are pieces of code, the attackers have a lot of leeway to perform various emulation detections such as those described earlier, and also to implement anti-tainting techniques via different covert channels (which we will discuss briefly later).

We assume here that the tainting propagation is only applied on particular mnemonic types like the assignment, arithmetic and the logical operation types. This is called ‘explicit direct flow’ tracking. It’s the classical dynamic data tainting implementation. We’ll see that in some cases it will not be enough to keep a consistent propagation.

For example, a difficult scenario would be to handle the use of a tainted value as an index in a non-tainted character array (since the array was originally on the system and therefore not derived from monitored code). The generated strings are then probably interesting but require the implementation of tainting propagation between the pointer and the pointed value. This is perfectly achievable. However, in the case of an object code that will naturally make extremely intensive use of pointers, this will lead to massive propagation pollution.

Let’s look at another illustration of this kind of blind view. A doubleword (32 bits) received from the network is an array of bit flags used by our process. Bit 7 of these flags will be controlled as follows:

(1) mov eax,[our_data]
(2) and eax,0x80
( ) je skip
(3) mov ebx,0xffffffff
( ) skip:

The previous case is known as indirect dependence (or control dependence) on our explicit flows. Another way to manage these cases without having to propagate at an eflag bits level is to taint the PC (Program Counter, EIP on x86 architecture) at each comparison (or on a mnemonic that will affect eflag) by the tainted tag value of the operand involved. Therefore, after each conditional jump, if the Program Counter is tainted, the different values assigned in the underlying basic block (in the static analysis meaning) will be tainted as well. The Program Counter taint value will then change at the end of the basic block or on a new comparison.

Note that if this method addresses the previous problem, it will be efficient only with a very simple control flow. It is extremely easy to modify the previous example in order to evade this implementation of control dependency tracking:

( ) mov eax,[our_data]
( ) and eax,0x80
( ) je skip
( ) xor ecx,ecx
(1) cmp ecx,1
( ) je skip2
( ) nop
( ) skip2:
( ) mov ebx,0xffffffff
( ) skip:

We simply have to force a new condition on dummy untainted values in step (1) in order to remove the tainting mark of the Program Counter, and consequently we lose the tainting on the ebx register. To address this, we could consider a stacking of the different taint values of the Program Counter in function of the flow control call depth. Obviously, this is completely inefficient with obfuscated code; indeed the purpose of the packer is precisely to add many opaque predicates that will add complexity to the control flow graph. This, combined with a technique of hashing the legitimate control flow [6], will finally make our previous implementation obsolete and possibly vulnerable. Moreover, in our context, the initial state involves a large volume of legitimately tainted data (between 15ko and 1Mo on average). The tracking of indirect dependencies (such as for pointer dependencies) will generate too great a degree of pollution.

Despite all these problems, we have an interesting advantage given the fact that we mark the monitored code; we can consider implementing the propagation of control dependencies (and pointers) only from a tainted piece of code. The inner workings of Qemu could lend itself quite well to this modification, in the sense that Qemu itself uses basic blocks (this could be developed further in the future).

However, the tracking of indirect dependencies doesn’t guarantee that no legitimate tainted marks will be lost. The problem lies in tracking implicit indirect flows (a set of assignments brought about by the non-execution of a piece of code). With a similar form to the previous example:

( ) mov byte prt al,[our_data]
( ) mov ecx,0xff
( ) do_it:
(3) mov ebx,1
(1) cmp al,cl
( ) je skip
(2) xor ebx,ebx
( ) skip:
(4) test ebx,ebx
( ) jne done
( ) loop do_it
( ) done:

In this new case (example taken from [7]), we loop on ecx, while the value of cl is different from the value in al that is tainted (1), the register ebx is then tainted (2), and at each new iteration the taint of ebx is deleted by the assignment (3). When equality between al and cl is attained in (1), the value of ebx remains unchanged. Ebx is then not tainted when in (4) it validates its non null value as a loop release condition. We then reach the label ‘done’ with a cl value equal to our tainted value, but without being able to propagate this taint mark on the register ecx.

There are various methods to propagate the tainting despite this kind of implicit indirect flow, some of them use static pre-analysis, others only apply on theoretical machine models dedicated to research (c.f. [8]). There is no absolute solution here.

Another common problem referred to in the literature dealing with the data flow is the use of covert channels, but in our context we are not dealing with privacy, and information leaks through time covert channels, as discussed in [7] don’t affect us. However, other covert channels could. The previous example (pointer propagation) which uses untainted data that was originally present on the system to illegitimately generate untainted viral data is proof of this. And that’s not the only example. Let’s consider malicious software using a configuration file consisting of only a cryptographic (hash) of its target names strings. It would then read the current navigation site name, generate a (hash) digest and compare it with those in its configuration file. These types of blind views leave the solution completely ineffective.

Conclusion

Data tainting is a powerful tool but very difficult to calibrate. The main difficulty lies in establishing a propagation policy that is sufficiently delicate, but that will not involve full pollution of the taintmap, and then in its implementation. This can be done over time by calibrating against different samples of malicious software.

In the third part of this article we will look at the implementation of data tainting.

Bibliography

[1] Raffetseder, T.; Kruegel, C.; Kirda E. Detecting System Emulators. http://www.seclab.tuwien.ac.at/papers/detection.pdf.

[2] Rutkowska, J. Red Pill... or how to detect VMM using (almost) one CPU instruction. http://www.invisiblethings.org/papers/redpill.html.

[3] Ferrie, P. Attacks on More Virtual Machine Emulators. http://pferrie.tripod.com/papers/ attacks2.pdf.

[4] Ormandy, T. An Empirical Study into the Security Exposure to Hosts ofHostile Virtualized Environments. http://taviso.decsystem.org/ virtsec.pdf, http://www.iseclab.org/papers/ ttanalyze.pdf.

[5] Goguen, J.A.; Meseguer, J. Security Policy and Security Models, Proc. IEEE Symp. Security and Privacy, pp.11–20, 1982. http://www.cs.ucsb.edu/ ~kemm/courses/cs177/noninter.pdf.

[6] Collberg, C.; Thomborson, C.; Low D. A Taxonomy of Obfuscating Transformations. http://www.cs.arizona.edu/~collberg/Research/Publications/CollbergThomborsonLow97a/A4.pdf.

[7] Cavallaro, L.; Saxena, P.; Sekar R. Anti-Taint-Analysis: Practical Evasion Techniques Against Information Flow Based Malware Defense. http://www.seclab.cs.sunysb.edu/seclab/pubs/ ata07.pdf.

[8] Le Guernic, G. Confidentiality Enforcement Using Dynamic Information Flow Analyses. http://tel.archives-ouvertes.fr/docs/00/19/86/21/PDF/thesis_report.pdf.

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive