`
`Charles LeDoux and Arun Lakhotia
`
`Abstract Malware analysts use Machine Learning to aid in the fight against
`the unstemmed tide of new malware encountered on a daily, even hourly, basis.
`The marriage of these two fields (malware and machine learning) is a match made
`in heaven: malware contains inherent patterns and similarities due to code and code
`pattern reuse by malware authors; machine learning operates by discovering inherent
`patterns and similarities. In this chapter, we seek to provide an overhead, guiding
`view of machine learning and how it is being applied in malware analysis. We do
`not attempt to provide a tutorial or comprehensive introduction to either malware or
`machine learning, but rather the major issues and intuitions of both fields along with
`an elucidation of the malware analysis problems machine learning is best equipped
`to solve.
`
`1 Introduction
`
`Malware, short for malicious software, is the weapon of cyber warfare. It enables
`online sabotage, cyber espionage, identity theft, credit card theft, and many more
`criminal, online acts. A major challenge in dealing with the menace, however, is its
`sheer volume and rate of growth. Tens of thousands of new and unique malware are
`discovered daily. The total number of new malware has been growing exponentially,
`doubling every year over the last three decades.
`Analyzing and understanding this vast sea of malware manually is simply impos-
`sible. Fortunately for the malware analyst, very few of these unique malware are truly
`novel. Writing software is a hard problem, and this remains the case whether said
`software is benign or malicious. Thus, malware authors often reuse code and code
`
`C. LeDoux (B) · A. Lakhotia
`
`Center for Advanced Computer Studies, University of Louisiana at Lafayette,
`PO Box 44330, Lafayette, LA 70504, USA
`e-mail: charles.a.ledoux@gmail.com
`
`A. Lakhotia
`e-mail: arun@louisiana.edu
`
`© Springer International Publishing Switzerland 2015
`R.R. Yager et al. (eds.), Intelligent Methods for Cyber Warfare,
`Studies in Computational Intelligence 563, DOI 10.1007/978-3-319-08624-8_1
`
`1
`
`
`
`2
`
`C. LeDoux and A. Lakhotia
`
`patterns in creating new malware. The result is the existence of inherent patterns and
`similarities between related malware, a weakness that can be exploited by malware
`analysts.
`In order to capitalize on this inherent similarity and shared patterns between
`malware, the anti-malware industry has turned to the field of Machine Learning, a
`field of research concerned with “teaching” computers to recognize concepts. This
`“learning” occurs through the discovery of indicative patterns in a group of objects
`representing the concept being taught or by looking for similarities between objects.
`Though humans too use patterns in learning, such as using color, shape, sound, and
`smell to recognize objects, machines can find patterns in large swaths of data that
`may be gibberish to a humans, such as the patterns in sequences of bits of a collection
`of malware. Thus, Machine Learning has a natural fit with Malware Analysis since
`it can more rapidly learn and find patterns in the ever growing corpus of malware
`than humans.
`Both Machine Learning and Malware Analysis are very diverse and varied fields
`with equally diverse and varied ways in which they overlap. In this chapter, we seek to
`provide a guiding, overhead cartography of these varied landscapes, focusing on the
`areas and ways in which they overlap. We do not seek to provide a comprehensive
`tutorial or introduction to either Malware or Machine Learning research. Instead,
`we strive to elucidate the major ideas, issues, and intuitions for each field; pointing
`to further resources when necessary. It is our intention that a researcher in either
`Malware Analysis or Machine Learning can read this chapter and gain a high-level
`understanding of the other field and the problems in Malware that Machine Learning
`has, is, and can be used to solve.
`
`2 A Short History of Malware
`
`The theory of malware is almost as old as the computer itself, tracing back to lec-
`tures by von Neumann in late 1940s on self-reproducing automata [1]. These early
`malware, if they can be called as such, did nothing significantly more than demon-
`strate self-reproduction and propagation. For example, one of the earliest malware
`to escape “into the wild” was called Elk Cloner and would simply display a small
`poem every 50th time an infected computer was booted:
`
`Elk Cloner: The program with a personality
`
`I t will get on all your disks
`I t will
`i n f i l t r a t e your chips
`Yes it ’s Cloner!
`
`I t will stick to you like glue
`I t will modify ram too
`Send in the Cloner!
`
`
`
`Malware and Machine Learning
`
`3
`
`The term computer virus was coined in early 1980s to describe such
`self-replicating programs [2]. The use of the term was influenced by the analogy of
`computer malware to biological viruses. A biological virus comes alive after it infects
`a living organism. Similarly, the early computer viruses required a host—typically
`another program—to be activated. This was necessitated by the limitations of the
`then computing infrastructure which consisted of isolated, stand-alone, machines.
`In order to propagate, that is infect currently uninfected machines, a computer virus
`necessarily had to copy itself in various drives, tapes, and folders that would be
`accessed by different machines. In order to ensure that the viral code was executed
`when it reached the new machine, the virus code would attach itself to, i.e. infect,
`another piece of code (a program or boot sector) that would be executed when the
`drive or tape reached another machine. When the now infected code would later
`execute, so would the viral code, furthering the propagation.
`The early viruses remained mostly pranks. Any damage they caused, such as crash-
`ing a computer or exhausting disk space, was largely unintentional and a side effect of
`uncontrolled propagation. However, the number and spread of viruses quickly grew
`to enough of a nuisance that it led to the development of first anti-virus companies in
`the late 1980s. Those early viruses were simple enough that they could be detected
`by specific sequences of bytes, a la signatures.
`The advent of networking, leading to the Internet, changed everything. Since
`data could now be transferred between computers without using an external storage
`device, so could the viruses. This freedom to propagate also meant that a virus no
`longer needed to infect a host program. A new class of malware called worm emerged.
`A worm was a stand alone program that could propagate from machine to machine
`without necessarily attaching to any other program.
`Malware writing too quickly morphed from simple pranks into malicious vandal-
`ism, such as that done by the ILOVEYOU worm. This worm came as an attachment
`to an email with the (unsurprising) subject line “ILOVEYOU”. When a user would
`open the attachment, the worm would first email itself to the user’s contacts and
`then begin destroying data on the current computer. There were a number of similar
`malware created, designed only to wreak havoc and gain underground notoriety for
`their authors. These “graffiti” malware, however, soon gave way to the true threat:
`malware designed to make money and steal secrets.
`Malware today has little if any resemblance to the malware of past. For one,
`gone are the simple days of pranks and vandalism conducted by bored teenagers and
`budding hackers. Modern malware is an well-organized activity forming a complete
`underground economy with its own supply chain. Malware is now a tool used by large
`underground organizations for making money and a weapon used by governments
`for espionage and attacks. Malware targeted towards normal, everyday computers
`can be designed to steal bank and credit card information (for direct theft of money),
`harvest email addresses (for selling to spammers), or gain remote control of the
`computer. The major threat from malware, however, comes from malware targeted not
`towards the average computer, but towards a particular corporation or government.
`These malware are designed to facilitate theft of trade or national secrets, steal
`crucial information (such as sensitive emails), or attack infrastructure. For example,
`
`
`
`4
`
`C. LeDoux and A. Lakhotia
`
`Stuxnet was malware designed to attack and damage various nuclear facilities in
`Iran. These malware often have large organizations (such as rival corporations) or
`even governments behind them.
`
`3 Types of Malware
`
`Whenever there is a large amount of information or data, it helps to categorize and
`organize it so that it can be managed. Classification also aids in communication
`between people, giving them a common nomenclature. The same is true of malware.
`The industry uses a variety of methods to classify and organize malware. The classi-
`fication is often based on the method of propagation, the method of infection, and the
`objective of the malware. There is, however, no known standard nomenclature that
`is used across the industry. Classifications sometimes also come with legal impli-
`cations. For instance, can a program that inserts advertisements as you browse the
`web be termed as malicious. What if the program was downloaded and installed by
`the user, say after being enticed by some free offering? To thwart legal notices the
`industry invented the term potentially unwanted program or PUP to refer to such
`programs.
`Though there is no accepted standard for classification of malware in the industry,
`there is a reasonable agreement on classifying malware on their method of propaga-
`tion into three types: virus, worm, and trojan (short for Trojan horse).
`Virus, despite being often used as a synonym for malware, technically refers to a
`malware that attaches a copy of itself to a host, as described earlier. Propagation by
`infecting removable media was the only method for transmission available prior to
`the Internet, and this method is still in use today. For instance, modern viruses travel
`by infecting USB drives. This method is still necessary to reach computer systems
`that are not connected to the Internet, and is hypothesized as the way Stuxnet was
`transmitted.
`A trojan propagates the same way its name sake entered the city of Troy, by hiding
`inside something that seems perfectly innocent. The earliest trojan was a game called
`ANIMAL. This simple game would ask the user a serious of questions and attempt to
`guess what animal the user was thinking of. When the game was executed, a hidden
`program, named PERVADE, would install a copy of itself and ANIMAL to every
`location the user had access to. A common modern example of a trojan is a fake
`antivirus, a program that purports to be an anti-virus system but in fact is a malware
`itself.
`A worm, as mentioned earlier, is essentially a self-propagating malware. Whereas
`a virus, after attaching itself to a program or document, relies on an action from a
`user to be activated and spread, a worm is capable of spreading between network
`connected computers all by itself. This is typically accomplished one of two ways:
`exploiting vulnerabilities on a networked service or through email. The worm CODE
`RED was an example of the first type of worm. CODE RED exploited a bug in a
`specific type of server that would allow a remote computer to execute code on the
`
`
`
`Malware and Machine Learning
`
`5
`
`server. The worm would simply scan the network looking for a vulnerable server.
`Once found, it would attempt to connect to the server and exploit the known bug.
`If successful, it would create another instance of the worm that repeated the whole
`process. The ILOVEYOU worm, discussed earlier, is an example of an email worm
`and spread as an email attachment. When a user opened the attachment, the worm
`would email a copy of itself to everyone in the user’s contact list and damage the
`current machine.
`While the above methods of propagation are the mostly commonly known, they
`by no means represent all possible ways in which malware can propagate. In general,
`one of two methods are employed to get a malware onto a system: exploit a bug in
`software installed on the computer or exploit the trust (or ignorance) of the user of
`the computer through social engineering. There are many different types of software
`bugs that allow for arbitrary code to be executed and almost as many ways to trick
`a user into installing a malware. Complicating matters further, There is no technical
`reason for a malware to limit its use to only one method of propagation. It is entirely
`conceivable, as was demonstrated by Stuxnet, for a malware to enter a network
`through email or USB, and then spread laterally to other machines by exploiting
`bugs.
`
`4 Malware Analysis Pipeline
`
`The typical end goal of malware analysis is simple: automatically detect malware
`as soon as possible, remove it, and repair any damage it has done. To accomplish
`this goal, software running on the system being protected (desktop, laptop, server,
`mobile device, embedded device, etc.) uses some type of “signatures” to look for
`malware. When a match is made on a “signature”, a removal and repair script is
`triggered. The various portions of the analysis “pipeline” all in one way or another
`support this end goal [3, 4].
`The general phases of creating and using these signatures are illustrated by Fig. 1.
`Creating a signature and removal instructions for a new malware occurs in the “Lab.”
`The input into this malware analysis pipeline is a feed of suspicious programs to
`be analyzed. This feed can come from many sources such as honeypots or other
`companies. This feed first goes through a triage stage to quickly filter out known
`programs and assign an analysis priority to the sample. The remaining programs
`are then analyzed to discover what it looks like and what it does. The results of the
`analysis phase are used to create a signature and removal/repair instructions which
`are then verified for correctness and performance concerns. Once verified, these
`signatures are propagated to the end system and used by a scanner to detect, remove,
`and repair malware.
`Each of the various phases of the anti-malware analysis process is attempting to
`accomplish a related, but independent task and thus has its own unique goals and
`performance constraints. As a result, each phase can independently be automated
`and optimized in order to improve the performance of the entire analysis pipeline.
`
`
`
`6
`
`C. LeDoux and A. Lakhotia
`
`Fig. 1 Phases of the malware analysis pipeline
`
`In fact, it is almost a requirement that automation techniques be tailored for the
`specific phase they are applied in, even if the technique could be applied to multiple
`phases. For example, a machine learning algorithm designed to filter out already
`analyzed malware in the triage stage will most likely perform poorly as a scanner.
`While both the triage stage and the scanner are accomplishing the same basic task,
`detect known malware, the standard by which they are evaluated is different.
`
`4.1 Triage
`
`The first phase of analysis, triage, is responsible for filtering out already analyzed
`malware and assigning analysis priority to the incoming programs. Malware ana-
`lysts receive a very large number of new programs for analysis every day. Many
`of these programs, however, are essentially the same as programs that have already
`been analyzed and for which signatures exist. A time stamp or other trivial detail
`may have been changed causing a hash of the binary to be unique. Thus, while the
`program is technically unique, it does not need to reanalyzed as the differences are
`inconsequential. One of the purposes of triage is to filter these binaries out.
`In addition to filtering out “exact” matches (programs that are essentially the
`same as already analyzed programs), triage is typically also tasked with assigning
`the incoming programs into malware families when possible. A malware family is
`a group of highly related malware, typically originating from common source code.
`If an incoming program can be assigned to a known malware family, any further
`analysis does not need to start with zero a priori knowledge, but can leverage general
`knowledge about the malware family, such as known intent or purpose.
`A final purpose of the triage stage is to assign analysis priority to incoming
`programs. Humans still are and most likely will remain an integral part of the analysis
`pipeline. Like any other resource, what the available human labor is expended upon
`must be carefully chosen. Not all malware are created equal; it is more important
`
`
`
`Malware and Machine Learning
`
`7
`
`that some malware have signatures created before others. For example, malware that
`only affects, say, Microsoft Windows 95 will not have the same priority as malware
`that affects the latest version of Windows.
`The performance concerns for the triage phase are (1) ensuring that programs
`being filtered out truly should be removed and (2) efficient computation in order to
`achieve very high throughput. Programs filtered out by triage are not subjected to
`further analysis and thus it is very important that they do not actually need further
`analysis. Especially dangerous is the case of malware being filtered out as a benign
`program. In this case, that particular malware will remain undetectable. Marking
`a known malware or a benign program as malware for further processing, while
`undesirable, is not disastrous as it can still be filtered out in the later processing stages.
`Along the same lines, it is sufficient that malware be assigned to a particular family
`with only a reasonably high probability rather than near certainty. Finally, speed is
`of the utmost importance in this stage. This stage of the analysis pipeline examines
`the largest number of programs and thus requires the most efficient algorithms.
`Computationally expensive algorithms at this stage would cause a backlog so great
`that analysts would never be able to keep up with malware authors.
`
`4.2 Analysis
`
`In the analysis phase, information about what the program being analyzed does, i.e.
`its behavior, is gathered. This can be done in two ways: statically or dynamically.
`Static analysis is performed without executing the program. Information about
`the behavior of the program is extracted by disassembling the binary and converting
`it back into human readable machine code. This is not high level source code, such as
`C++, but the low level assembly language. An assembly language is the human read-
`able form of the instructions being given directly to the processor. ARM, PowerPC,
`and ×86 are the better known examples of assembly languages. After disassembly,
`the assembly code (often just called the malware “code” for short) can be analyzed
`to determine the behavior of the program. The methods for doing this analysis con-
`stitute an entire research field called program analysis and as such are outside the
`scope of this chapter. Nielson et al. [5] have a comprehensive tutorial to this field.
`Static analysis can theoretically provide perfect information about the behavior of
`a program, but in practice provides an over approximation of the behaviors present.
`Only what is in the code is what can be executed, thus the code contains everything
`the program can do. However, extracting this information from a binary can be
`difficult, if not impossible. Perfectly solving many of the problems of static analysis
`is undecidable.
`As an example of the problems faced by static analysis, binary disassembly is
`itself an undecidable problem. Binaries contain both data and code and separating
`the two from each other is undecidable. As a result some disassemblers treat the entire
`binary, including data, as if it were code. This results in a proper extraction of most
`of the original assembly code, along with much code that never originally existed.
`
`
`
`8
`
`C. LeDoux and A. Lakhotia
`
`There are many other methods of disassembly, such as the recent work by Schwarz
`et al. [6]. While these methods significantly improve on the resulting disassembly,
`none can guarantee correct disassembly. For instance, it is possible that there exists
`“dead code” in the original binary, i.e. code that can never be reached at runtime.
`In an ideal disassembly, such code ought to be excluded. Thus all of static analysis
`operates on approximations. Most disassemblers used in practice do not guarantee
`either over approximation or under approximation.
`Dynamic analysis, in contrast with static analysis, is conducted by actually exe-
`cuting the program and observing what it does. The program can be observed from
`either within or without the executing environment. From within uses the same tools
`and techniques software developers use to debug their own programs. Tools that
`observe the operating system state can be utilized and the analyzed program run in
`a debugger. Observation from without the execution environment occurs by using a
`specially modified virtual machine or emulator. The analyzed program is executed
`within the virtual environment and the tools providing the virtualization observe and
`report the behavior of the program.
`Dynamic analysis, as opposed to static analysis, generally provides an under
`approximation of the behaviors contained in the analyzed program, but guarantees
`that returned behaviors can be exhibited. Behaviors discovered by dynamic analysis
`are obviously guaranteed to be possible as the program was observed performing
`these behaviors. Only the observed behaviors can be returned, however. A single
`execution of a program is not likely to exhibit all the behaviors of the program as
`only a single path of execution through the binary is followed per run. A differing
`execution environment or differing input may reveal previously unseen behaviors.
`
`4.3 Signatures and Verification
`
`While the most common image conjured by the phrase “malware signatures” is
`specific patterns of bytes (often called strings) used by an Anti-Virus system to detect
`a malware, we do not use the term in that restricted sense. What we mean by signature
`is any method utilized for determining if a program is malware. This can include the
`machine learning system built to recognize malware, a set of behaviors marked as
`malicious, a white list (anything not on the white list is marked as malicious), and
`more. The important thing about a signature is that it can be used to determine if a
`program is malware or not.
`Along with the signatures, instructions for how to remove malware that has
`infected the system and repair any damage it has done must also be created. This
`is usually done manually, utilizing the results of the analysis stage. Observe what
`the malware did, and then reverse it. One major concern here is ensuring that the
`repair instructions do not cause even more damage. If the malware changed a registry
`key, for example, and the original key is unknown, it may be safest to just leave the
`key alone. Changing it to a different value or removing it all together may result
`
`
`
`Malware and Machine Learning
`
`9
`
`in corrupting the system being “protected.” Thus repair instructions are often very
`conservative, many times only removing the malware itself.
`Once created, the signatures need to be verified for correctness and, more impor-
`tantly, for accuracy. Even more important than creating a signature that matches the
`malware is creating a signature that only matches the malware. Signatures that also
`match benign programs are worse than useless; they are acting like malware them-
`selves! Saying that benign programs are actually malware, called a false positive,
`is an error that cannot be tolerated once the signatures have been deployed to the
`scanner.
`
`4.4 Application
`
`Once created, the signatures are deployed to the end user. At the end system, new
`files are scanned using the created signatures. When a file matches a signature, the
`associated repair instructions followed.
`The functionality of the scanner will depend on the type of signature created.
`String based signatures will use a scanner that checks for existence of the string in
`the file. A scanner based on Machine Learning signatures will apply what has been
`learned through ML to detect malware. A rule based scanner will check if the file
`matches its rules, and so on and so forth.
`
`5 Challenges in Malware Analysis
`
`One of the fundamental problems associated with every step of the malware analysis
`pipeline is the reliance on incomplete approximations. In every stage of the pipeline,
`the exact solution is generally impossible. Triage cannot perfectly identify every
`part of every program that has already been identified. Analysis will generate either
`potentially inaccurate or incomplete information. All types of signatures are limited.
`Even verification is limited by what can be practically tested.
`Naturally, malware authors have developed techniques that directly attack each
`stage of the analysis pipeline and shift the error in the inherent approximations to their
`favor. Packing and code morphing are used against triage to increase the number of
`“unique” malware that must be analyzed. Packing, tool detection, and obfuscation are
`used against the analysis stage to increase the difficultly of extracting any meaningful
`information.
`While the ultimate goal of the malware authors is obviously to completely avoid
`detection, simply increasing the difficulty of achieving detection can be considered a
`“win” for the malware authors. The more resources consumed in analyzing a single
`malware, the less total malware that can be analyzed and detected. If this singular
`cost is driven high enough, then detection of any but the most critical malware simply
`becomes too expensive.
`
`
`
`10
`
`5.1 Code Morphing
`
`C. LeDoux and A. Lakhotia
`
`The most common and possibly the most effective attack against the malware analysis
`pipeline targets the first stage: triage. The attack is to simply inundate the pipeline
`with as many unique malware as possible. Unique is not used here to mean novel,
`i.e. does something unique; here it simply means that the triage stage considers it
`something that has not been analyzed before. Analysis stages further down the pipe
`from Triage are allowed to be more expensive because it is assumed Triage has
`filtered out already analyzed malware, severely reducing the number of malware the
`expensive processes are run on. By slipping more malware past Triage and forcing
`the more expensive processes to run, the cost of analysis can be driven up, possibly
`prohibitively high.
`One of the ways this attack is accomplished is through automated morphing of
`the malware’s code into a different but semantical equivalent form. Such malware
`is often called metamorphic or polymorphic. Before infecting a new computer, a
`rewriting engine changes what the code looks like through such means as altering
`control flow, utilizing different instructions, and adding instructions that have no
`semantic effect. The changes performed by the rewriting engine only change the
`look or syntax of the code and leave its function or semantics intact. The result is
`that each “generation” of metamorphic malware is functionally equivalent, but the
`code can be radically different.
`While several subtle variations in definitions exist, we view the difference between
`metamorphic and polymorphic malware as where the rewriting engine lies. Metamor-
`phic malware contains its own, internal rewriting engine, that is, the malware binary
`rewrites itself. Polymorphic malware, on the other hand, have a separate mutating
`engine; a separate binary rewrites the malware binary. This mutating engine can
`either be distributed with the malware (client side) or kept on a distributing server
`and simply distribute a different version of malware every time (server side).
`Metamorphic malware is more limited than polymorphic malware in the transfor-
`mations it can safely perform. Any rewriting engine is going to contain limitations
`as to what it can safely take as input. If the engine is designed to modify the con-
`trol flow of the program, for example, it will only be able to rewrite programs for
`which it can identify the existing control flow. Since metamorphic malware contains
`its own rewriting engine, the output of the rewriting engine must be constrained to
`acceptable input. Without this constraint, further mutations would not be possible.
`Polymorphic malware, however, does not contain this constraint. Since the rewriting
`engine is separate and can thus always operate over the exact some input, the output
`does not need to be constrained to only acceptable input.
`
`
`
`Malware and Machine Learning
`
`5.2 Packing
`
`11
`
`Packing is a process whereby an arbitrary executable is taken and encrypted and
`compressed into a “packed” form that must be uncompressed and decrypted, i.e.
`“unpacked”, before execution. This packed version of the executable is then pack-
`aged as data inside another executable that will decompress, decrypt, and run the
`original code. Thus, the end result is a new binary that looks very different from the
`original, but when executed performs the exact same task, albeit with some additional
`unpacking work. A program that does packing is referred to a packer and the newly
`created executable is called the packed executable.
`Packing directly attacks Triage and static analysis. While packing a binary does not
`modify any of the malware’s code, it drastically modifies the binary itself, potentially
`even changing a number of statistical properties. If there is some randomization
`within the packing routine, a binary that appears truly unique will result every time
`the exact same malware is packed. Unless the Triage stage can first unpack the binary,
`it will not be able to match it to any known malware.
`Packing does more than simply complicate the triage stage, it also directly attacks
`any use of static analysis. As discussed in Sect. 4.2, the first step in static analysis
`is usually to disassemble the binary. Packing, however, often encrypts the original
`binary, preventing direct disassembly. A disassembler will not be able to mean-
`ingfully interpret the stored bits unless it is first unpacked and the original binary
`recovered.
`The need to unpack a program (recover the original binary) is usually not a straight
`forward task—hence the existence of a challenge. As one might expect, there exists
`very complex packers intentionally designed to foil unpacking. Some packers, for
`example, only decrypt a single instruction at a time while others never fully unpack
`the binary and instead run the packed program in a virtual machine with a randomly
`created instruction set.
`It might seem that simply detecting that an executable was packed would be
`sufficient to determine that it was malware. There are, however, legitimate uses for
`packing. First, packing is capable of reducing the overall size of the binary. The
`compression rate of the original binary is often large enough that even with the
`additional unpacking routine (which can be made fairly small), the packed binary
`is smaller in size than the original binary. Of course, when size is the only concern,
`the encryption part of packing is unnecessary. So, perhaps detecting encryption is
`sufficient? Unfortunately, no. Encryption has a legitimate application in protecting
`intellectual property. A software developer may compress and encrypt the executables
`they sell and ship to prevent a competitor from reversing the program and discovering
`trade secrets.
`
`
`
`12
`
`5.3 Obfuscation
`
`C. LeDoux and A. Lakhotia
`
`While packing attempts to create code that cannot be interpreted at all, obfuscation
`attempts to make extracting meaning from the code, statically or dynamically, as dif-
`ficult as possible. In general, obfuscation refers to writing or transforming a program
`into a form that hides its true functionality. The simplest example of a source code
`obfuscation is to give all variables meaningless names. Without descriptive names,
`the analyst must determine the purpose of each variable. At the binary level, examples
`of obfuscation include adding dead code (valid code that is never executed), inter-
`leaving several procedures within each other, and running all control flow through a
`single switch statement (called control flow flattening). An in depth treatment of code
`obfuscation, including methods for deobfuscating the code, is given by Collberg and
`Nagra [7].
`
`5.4 Tool Detection
`
`A major problem in dynamic analysis is malware detecting that it is being analyzed
`and modifying its behavior. Static analysis has a slight advantage in that the analyzed
`malware has no control over the analysis process. In dynamic analysis, however,
`the malware is actually being executed and so can be made capable of altering its
`behavior. Thus, malware authors will often check to see if any of the observation
`tools often used



