File-fraction reputation based on digest of high granularity

Ethan YX Chen Trend Micro

How to decide whether a file is benign or malicious has been a critical problem for the anti-malware industry for a long time. One recently popular approach is to build a large database to store the characteristics (e.g. download source, prevalence or age) and/or file analysis results related to file instances for later use. This kind of approach is categorized as reputation-based technology.

Reputation-based technology applies a statistics-based method to the characteristics to determine the reputation of a file. The characteristics being used (e.g. prevalence, URL) are usually different from content-based technology (e.g. a malware definition composed of sequences of bytes).

In this paper we propose a solution to combine the reputation-based and content-based solutions. It provides a different perspective on the efforts to fight against today's highly polymorphic, micro-distribution malware. The basic idea is to factorize the content into 'fractions' by a rolling hash, and then build the reputation information of those fractions. The content to be factorized can either be from raw files or memory dumps; for memory dumps it helps to detect packed malware and also benefits memory forensics. Malware files of the same family often share at least several identical fractions, especially fractions from the memory dump. Some fractions can also be identified to be part of some tool, whether benign/neutral (e.g. AutoIT), aggressive (e.g. remote control tool) or malicious (malware toolkits).

Several possible applications of file-fraction reputation will also be discussed.