`Based Malware Detection
`
`Yanfang Ye
`Comodo Security Solutions
`Beijing, 100082, P.R.China
`yanfang@comodo.com
`
`Weiwei Zhuang
`Xiamen University
`Xiamen, 361005, P.R.China
`zhuangweiwei@gmail.com
`
`Shenghuo Zhu
`NEC Laboratories America
`Cupertino, CA, 95129, USA
`zsh@sv.nec-labs.com
`
`Tao Li
`School of Computer Science
`Florida International University
`Miami, FL, 33199, USA
`taoli@cs.fiu.edu
`Egemen Tas, Umesh
`Melih Abdulhayoglu
`Gupta
`Comodo Security Solutions
`New Jersey, NJ, 07310, USA
`Comodo Security Solutions
`melih@comodo.com
`New Jersey, NJ, 07310, USA
`{egemen,umesh}@comodo.com
`
`ABSTRACT
`Due to their damages to Internet security, malware (such as virus,
`worms, trojans, spyware, backdoors, and rootkits) detection has
`caught the attention not only of anti-malware industry but also of
`researchers for decades. Resting on the analysis of file contents
`extracted from the file samples, like Application Programming In-
`terface (API) calls, instruction sequences, and binary strings, data
`mining methods such as Naive Bayes and Support Vector Machines
`have been used for malware detection. However, besides file con-
`tents, relations among file samples, such as a “Downloader” is
`always associated with many Trojans, can provide invaluable in-
`formation about the properties of file samples. In this paper, we
`study how file relations can be used to improve malware detec-
`tion results and develop a file verdict system (named “Valkyrie”)
`building on a semi-parametric classifier model to combine file con-
`tent and file relations together for malware detection. To the best
`of our knowledge, this is the first work of using both file content
`and file relations for malware detection. A comprehensive exper-
`imental study on a large collection of PE files obtained from the
`clients of anti-malware products of Comodo Security Solutions In-
`corporation is performed to compare various malware detection ap-
`proaches. Promising experimental results demonstrate that the ac-
`curacy and efficiency of our Valkyrie system outperform other pop-
`ular anti-malware software tools such as Kaspersky AntiVirus and
`McAfee VirusScan, as well as other alternative data mining based
`detection systems. Our system has already been incorporated into
`the scanning tool of Comodo’s Anti-Malware software.
`
`Categories and Subject Descriptors
`I.2.6 [Artificial Intelligence]: Learning; D.4.6 [Operating Sys-
`tem]: Security and Protection - Invasive software
`
`General Terms
`Algorithms, Experimentation, Security
`
`Keywords
`cloud based malware detection, file content, file relation, semi-
`parametric model for learning from graph
`
`INTRODUCTION
`1.
`1.1 Cloud Based Malware Detection
`Malware is software designed to infiltrate or damage a computer
`system without the owner’s informed consent (e.g., virus, worms,
`trojans, spyware, backdoors, and rootkits) [23]. Numerous attacks
`made by the malware pose a major security threat to Internet users
`[8]. Hence, malware detection is one of the internet security top-
`ics that are of great interest [4, 25, 13, 15, 19, 20, 27, 24]. Cur-
`rently, the most significant line of defense against malware is anti-
`malware software products, such as Kaspersky, MacAfee and Co-
`modo’s Anti-Malware software. Typically, these widely used mal-
`ware detection software tools use the signature-based method to
`recognize threats. Signature is a short string of bytes, which is
`unique for each known malware so that its future examples can be
`correctly classified with a small error rate.
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`KDD’11, August 21–24, 2011, San Diego, California, USA.
`Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00.
`
`Figure 1: The Increment of Malware Samples (Data source:
`Comodo China Anti-Malware Lab).
`
`However, driven by the economic benefits, malware writers quickly
`invent counter-measures against proposed malware analysis tech-
`
`222
`
`IPR2023-00124
`CrowdStrike EX1015 Page 1
`
`
`
`Figure 2: The Workflow of Comodo Cloud Based Malware Detection Scheme.
`
`niques, chief among them being automated obfuscation [20]. Be-
`cause of automated obfuscation, today’s malware samples are cre-
`ated at a rate of thousands per day. Figure 1 shows the increasing
`trend of malware samples in P.R.China from Year 2003 to Year
`2010 (this data is provided by Comodo China Anti-Malware Lab).
`It can be observed that the number of malware samples has in-
`creased sharply since 2008. In fact the number of malware samples
`in 2008 alone is much larger than the total sum of previous five
`years.
`Nowdays malware samples increasingly employ techniques such
`as polymorphism [2], metamorphism [1], packing, instruction vir-
`tualization, and emulation to bypass signatures and defeat attempts
`to analyze their inner mechanisms [20]. In order to remain effec-
`tive, many Anti-Malware venders have turned their classic signature-
`based method to cloud (server) based detection. The work flow of
`the cloud based detection method adopted by Comodo Security So-
`lutions Incorporation is shown in Figure 2.
`The work flow of this cloud based malware detection scheme can
`be described as follows:
`
`1. On the client side, users may receive new files from emails,
`media or IM(Instant Message) tools when they are using the
`Internet.
`
`2. Anti-malware products will first use the signature set on the
`clients for scanning. If these new files are not detected by
`existing signatures, then they will be marked as “unknown”.
`
`3. In order to detect malware from the unknown file collection,
`file features (like file content as well as the file relations) are
`extracted and sent to Comodo Cloud Security Center.
`
`4. Based on these collected features, the classifier(s) on the cloud
`server will predict and generate the verdicts for the unknown
`file samples, either benign or malicious.
`
`5. Then the cloud server will send the verdict results to the
`clients and notify the clients immediately.
`
`6. According to the response from the cloud server, the scan-
`ning process can detect new malware samples and remove
`the threats.
`
`7. Due to the fast response from the cloud server, the client
`users can have most up-to-date security solutions.
`
`To sum-up, using the cloud-based architecture, malware detec-
`tion is now conducted in a client-server manner: authenticating
`valid software programs from a whitelist and blocking invalid soft-
`ware programs from a blacklist using the signature-based method
`at the client (user) side, and predicting any unknown software (i.e.,
`the gray list) at the cloud (server) side and quickly generating the
`verdict results to the clients within seconds. The gray list, contain-
`ing unknown software programs which could be either benign or
`malicious, was usually authenticated or rejected manually by mal-
`ware analysts before. With the development of the malware writing
`techniques, the number of file samples in the gray list that need to
`be analyzed by malware analysts on a daily basis is constantly in-
`creasing. For example, the gray list collected by the Anti-Malware
`Lab of Comodo Security Solutions Incorporation usually contains
`about 500,000 file samples per day. Therefore, there is an urgent
`need for anti-malware industry to develop intelligent methods for
`efficient and effective malware detection at the cloud (server) side.
`Recently, many research efforts have been conducted on devel-
`oping intelligent malware detection systems [4, 25, 13, 15, 19, 20,
`27, 24].
`In these systems, the detection process is generally di-
`vided into two steps: feature extraction and classification. In the
`first step, various features such as Application Programming Inter-
`face (API) calls [27] and program strings [13, 19, 18] are ex-
`tracted to capture the characteristics of the file samples.
`In the
`second step, intelligent classification techniques such as decision
`trees [17], Naive Bayes, and associative classifiers [24, 13, 19,
`27] are used to automatically classify the file samples into different
`classes based on computational analysis of the feature representa-
`tions. These intelligent malware detection systems are varied in
`their use of feature representations and classification methods. For
`example, IMDS [27] performs association classification on Win-
`dows API calls extracted from executable files, while Naive Bayes
`methods on the extracted strings and byte sequences are applied in
`[19].
`These intelligent techniques have isolated successes in classify-
`ing particular sets of malware samples, but they have limitations
`that leave a large room for improvement.
`In particular, none of
`these techniques have taken the relationships among file samples
`into consideration for malware detection. Simply treating file pro-
`grams as independent samples allows many off-the-shelf classifica-
`tion tools to be directly adapted for malware classification. How-
`ever, the relationships among file samples may imply the inter-
`dependence among them and thus the usual i.i.d (independent and
`identical distributed) assumption may not hold for malware sam-
`
`223
`
`IPR2023-00124
`CrowdStrike EX1015 Page 2
`
`
`
`ples. As a result, ignoring the relations among file samples is a
`significant limitation of current malware classification methods.
`1.2 Relations Among File Samples
`For malware detection, the relations among file samples provide
`invaluable information about their properties. Here we use some
`examples for illustration. Based on the collected file lists from
`clients, we construct a co-occurrence graph to describe the rela-
`tions among file samples. Generally, two files are related if they
`are shared by many clients (or equivalently, file lists). As shown
`in Figure 3, we can observe that the file “yy(1).exe” is associ-
`ated with many trojans which are marked as purple color. Ac-
`tually, this “yy(1).exe” file is a kind of Trojan-Downloader mal-
`ware. Trojan-Downloader refers to any malicious software that
`downloads and installs multiple unwanted applications of adware
`and malware from remote servers. Malware samples of this type
`are spread from malicious websites or by emails as attachments or
`links, and are installed secretly without the user’s consent. There-
`fore, from the relations shown in Figure 3, we can infer that if an
`unknown file always co-occurs with many kinds of trojans in users’
`computers, then most likely, it is a malicious Trojan-Downloader
`file.
`
`Figure 3: File Relations Between a Trojan-Downloader and its
`Related Trojans.
`
`Another example showing the relations among benign files is il-
`lustrated in Figure 4. From Figure 4, we can observe that an un-
`known file “everest.exe” can be possibly recognized as benign since
`it is always associated with known benign files marked in green
`color. Actually, this “everest.exe” is a benign system diagnostic
`application which always co-occurs with its related Dynamic Link
`Library files, such as, “everest_start.dll”, “everest_mondiag.dll”,
`“everest_rcs.dll” and so on.
`Sometime it is not easy to determine whether a file is malicious
`or not solely based on file content information itself. According
`to the experience and knowledge of our anti-malware experts, file
`relations among samples can be a novel and practical feature repre-
`sentation for malware detection. Some malware samples may have
`stronger connections with benign files than malicious ones. In such
`cases, those file samples might be infected files. Actually, these
`unexpected relations can be filtered and removed, because the in-
`fected samples can be detected independently using the infected file
`detector which is developed by our anti-malware experts.
`1.3 Combining File Content and File Relation
`To improve the performance of file sample classification for mal-
`ware detection, in this paper, we utilize both file content and file re-
`lation information. However, relation information and file content
`have different properties. Relation information provides a graph
`
`224
`
`Figure 4: File Relations Between a Benign Application and its
`Related Dynamic Link Library files.
`
`structure in the data and induces pairwise similarity between ob-
`jects while the file content provides inherent characteristic infor-
`mation about the file samples. Although both the relation infor-
`mation and file content can be used independently to classify file
`samples, classification algorithms that make use of them simulta-
`neously should be able to achieve a better performance.
`The problem of combining content information and relation in-
`formation (i.e., link information) have been widely studied for web
`document categorization in data mining and information retrieval
`community [26, 9]. The approaches for combining content and
`link information generally fall into two categories: (1) feature inte-
`gration which treats the relation information as additional features
`and enlarges the feature representation [3, 11, 16]; and (2) Kernel
`Integration which integrates the data at the similarity computation
`or the Kernel level [10, 14]. However, both types of approaches
`have limitations: feature integration may degrade the quality of in-
`formation as file relations and file content typically have different
`properties, while kernel integration fails to explore the correlation
`and the inherent consistency between the content information and
`the relation information [31].
`1.4 Contributions of Our Paper
`In this paper, we propose a semi-parametric classification model
`for combining file content and file relations. The semi-parametric
`model consists of two components: a parametric component re-
`flecting file content information and a non-parametric component
`reflecting file relation information. The model seamlessly inte-
`grates these two components and formulates the classification prob-
`lem using the graph regularization framework. Our model can be
`viewed as an extension of recently developed joint-embedding ap-
`proaches which aims to seek a common low-dimensional embed-
`ding via joint factorization of both the content and relation infor-
`mation [31, 5, 30]. However, different from the joint-embedding
`approaches, our model does not explicitly infer the embedding and
`is directly optimized for classification. We develop a file verdict
`system (named "Valkyrie") using the proposed model to integrate
`file content and file relations for malware detection. To the best
`of our knowledge, this is the first work of using both file content
`and file relations for malware detection. In short, our developed
`Valkyrie system has the following major traits:
`• Novel Usage of File Relation: Different from previous stud-
`ies for malware detection, we not only make use of file con-
`tent, but also use the file relations for malware detection.
`• A Principled Model for Combining File Content and File Re-
`lations: We propose a semi-parametric classification model
`
`IPR2023-00124
`CrowdStrike EX1015 Page 3
`
`
`
`to seamlessly combine file content and file relation, and for-
`mulate the classification problem using the graph regulariza-
`tion framework.
`• A Practical Developed System for Real Industry Application:
`Based on 37,930 clients, we obtain 30,950 malware samples,
`225,830 benign files and 434,870 unknown files from Co-
`modo Cloud Security Center. We build a practical system for
`malware detection and provide a comprehensive experimen-
`tal study.
`
`All these traits make our Valkyrie system a practical solution
`for automatic malware detection. The case studies on large and
`real daily malware collection from Comodo Cloud Security Center
`demonstrate the effectiveness and efficiency of our Valkyrie sys-
`tem. As a result, our Valkyrie system has already been incorporated
`into the scanning tool of Comodo’s Anti-Malware software.
`1.5 Organization of The Paper
`The rest of this paper is organized as follows. Section 2 presents
`the overview of our Valkyrie system. Section 3 describes the fea-
`ture extraction and representation; Section 4 introduces the pro-
`posed semi-parametric model combining file content and file rela-
`tions together for malware detection; In Section 5, using the daily
`data collection obtained from Comodo Cloud Security Center, we
`systematically evaluate the effectiveness and efficiency of our Valkyrie
`system in comparison with other proposed classification methods,
`as well as some of the popular Anti-Malware software such as
`Kaspersky and NOD32. Section 6 presents the details of system
`development and operation. Section 7 discusses the related work.
`Finally, Section 8 concludes the paper.
`
`2. SYSTEM ARCHITECTURE
`Figure 5 shows the system architecture of our Valkyrie system.
`We briefly describe each component below.
`
`• Training:
`1. User File List and File Sample Collector: It collects
`the file lists from the clients which contain the poten-
`tial relations between file samples, together with the file
`samples.
`2. File Content Feature Exactor: Besides its high ex-
`traction efficiency compared with dynamic feature rep-
`resentation methods, Application Programming Inter-
`faced (API) calls can well reflect the behaviors of pro-
`gram code pieces. Therefore, our developed file con-
`tent feature extractor extracts the API calls from the
`collected malicious and benign Windows Portable Ex-
`ecutable (PE) files. (See Section 3.1 for details.)
`3. File Relation Feature Exactor: Based on the collected
`file lists from clients, a co-occurrence graph is con-
`structed to describe the file relations. Note that many
`unexpected relations (like relations between infected
`samples and benign files) are removed using infected
`file detectors. (See Section 3.2 for details.)
`4. Semi-Parametric Model Based Classifier: Our pro-
`posed semi-parametric model integrates file content and
`relation information and formulates the classification
`problem using the graph regularization framework. (See
`Section 4 for details.)
`
`225
`
`Figure 5: The System Architecture of Valkyrie.
`
`• Prediction: On the clients, our Comodo Anti-Malware soft-
`ware products authenticate valid software from a whitelist
`and block invalid software from a blacklist using the signature-
`based method. The gray list, containing unknown software
`programs which could be either normal or malicious, is then
`fed into our Valkyrie system. After file content and file rela-
`tion feature extractions, the semi-parametric model is applied
`to the gray list for prediction.
`
`3. FEATURE EXTRACTION
`Our Valkyrie system is performed directly on Windows Portable
`Executable (PE) codes. PE is designed as a common file format
`for all flavor of Windows operating system, and PE malware are
`in the majority of the malware rising in recent years [27]. In this
`section, we will introduce both file content and file relation feature
`extraction methods we adopted.
`3.1 File Content
`We extract the Application Programming Interface (API) calls
`from the Import Tables [27] of collected malicious and benign PE
`files, convert them to a group of 32-bit global IDs (for example, the
`API "MAPI32.MAPIReadMail" in encoded as 0X00000F12) as the
`content features, and stores these features in the signature database.
`A sample file content signature database is shown in Figure 6, in
`which there are 6 fields: record ID, PE file name, file type ("-1"
`represents benign file while "1" is for malicious file), called APIs
`name, called API ID, the total number of called API functions.
`3.2 File Relations
`Based on the collected file lists from clients, we construct a
`
`IPR2023-00124
`CrowdStrike EX1015 Page 4
`
`
`
`components: a parametric component reflecting file content infor-
`mation and a non-parametric component reflecting file relation in-
`formation.
`Let f be a vector, each of whose elements is the label (i.e., mali-
`cious or benign) of a file example to be predicted. The vector f can
`be generated from two parts, parametric and non-parametric ones.
`The parametric component follows a linear model, X!w, where
`each column of matrix X is the content feature vector of a file ex-
`ample, and w is the coefficients. The non-parametric part is just a
`vector of h, each element of which corresponds to a value of a file
`example. Combining two parts, we have f = X!w + h.
`Now considering the labeling information vector y. Let yi = 1
`if the i-th file sample is malicious, yi = −1 if the i-th file sample
`is benign, yi = 0 if the i-th file sample is unlabeled. We can use
`hinge loss for labeled file samples as in Support Vector Machine,
`or use L2 loss for labeled data as in Least Square problems. For
`simplicity, we follow [29] to use L2 loss on all data points, i.e. $y−
`f$2. We also consider the global consistency on the co-occurrence
`graph [29], f !Lf, where the symmetric matrix L is the normalized
`Laplacian of the graph. Thus the total loss is
`$y − f$2 +
`where α is the weight for combining two parts of information,
`2 is just for convenience.
`adding 1
`To limit the model complexity, we add the regularization terms
`for w and h, which are
`
`f!Lf ,
`
`α 2
`
`1 2
`
`(2)
`
`w!w +
`
`h!h,
`
`1
`1
`2γ
`2β
`where β and γ are the regularization parameters.
`Putting Eq. (2) and Eq. (3) together, we have optimization prob-
`lem:
`
`(3)
`
`f !Lf +
`
`1
`2β
`
`w!w +
`
`1
`2γ
`
`h!h (4)
`
`α 2
`
`1 2
`
`$y − f$2 +
`f = X!w + h.
`
`min
`f ,w,h
`subject to
`
`To solve Eq. (4), we introduce Lagrange multiplier ξ.
`1
`$y − f$2 +
`2β
`1
`h!h + ξ!(f − X!w − h).
`+
`2γ
`As ∂L∂w = 0, ∂L∂h = 0, ∂L∂ξ = 0, and ∂L∂f = 0, we have
`
`f!Lf +
`
`w!w
`
`α 2
`
`1 2
`
`L(f , w, h; ξ) =
`
`Figure 6: Sample File Content Features in the Signature
`Database.
`
`co-occurrence graph to describe the relations among file samples.
`Generally, two files are related if they are shared by many clients
`(or equivalently, file lists). Note that many unexpected relations
`(like relations between infected samples and benign files) are first
`removed using infected file detectors.
`The co-occurrence graph is defined as G =< V, E > where V
`is the set of file samples. Given two file samples vi and vj, let Si
`be the set of file lists containing vi and Sj be the set of file lists
`containing vj. Then the similarity between vi and vj is computed
`as
`
`,
`
`sim(vi, vj) = |Si ∩ Sj|
`|Si ∪ Sj|
`where |S| denotes the size of a set S. If the similarity between a
`pair of file samples is greater than 0, then there is an edge between
`them and E is the set of edges between vertices.
`An example graph is shown in Figure 7 illustrating the real re-
`lations between some file samples, where the size of each edge
`indicates its weight.
`
`(1)
`
`Figure 7: An example graph of real relations between some
`file samples (purple color-malware samples, green color-benign
`files, transparent color-unknown files).
`
`Plugging Eqs. (5,6) into Eq. (7), we have
`f = (βX!X + γI)ξ,
`
`w = βXξ
`h = γξ
`f = X!w + h
`y = f + αLf + ξ
`
`(5)
`(6)
`(7)
`(8)
`
`4. A SEMI-PARAMETRIC MODEL FOR COM-
`BINING FILE CONTENT AND FILE RE-
`LATIONS
`In this section, we propose a semi-parametric model to combine
`file content and file relations for classification using the graph reg-
`ularization framework. The semi-parametric model consists of two
`
`226
`
`or
`
`ξ = (βX!X + γI)−1f .
`Plugging it into Eq. (8), we have
`
`f =!I + αL + (βX!X + γI)−1"−1
`
`y.
`
`(9)
`
`This model is an extension of [29] by consider the parametric part.
`Note that if there are no content features, then f = h and this
`
`IPR2023-00124
`CrowdStrike EX1015 Page 5
`
`
`
`model reduces to traditional semi-supervised learning. Different
`from [31] and [30], this model does not infer the embedding.
`Computation Issues: We need to solve
`
`!I + αL + (βX!X + γI)−1" f = y
`(10)
`Let the size of X be p × n, where p is the number of feature and
`n is the number of instances, the average nonzeros element of L be
`κ. As long as p & n, we can follow the Woodbury identity, and
`Eq. (10) becomes
`!(1 + γ−1)I + αL − γ−1X(γβ−1I + XX!)−1X!" f = y
`
`(11)
`To solve Eq. (11), we can use conjugate gradient descent method.
`Computing XX! is O(np2), the inverse of (γβ−1I + XX!) is
`O(p3), and we precompute (γβ−1I+XX!)−1X! with O(np2 +
`p3). In each iteration of conjugate gradient descent, we compute
`
`#(1 + γ−1)I + αL − γ−1X(γβ−1I + XX!)−1X!$ v for some
`
`v. The computation of each iteration is O(n(p + κ)). The conver-
`gence rate depends on the condition number of the LHS matrix of
`Eq. (11).
`
`5. EXPERIMENTAL RESULTS
`AND ANALYSIS
`In this section, we conduct three sets of experimental studies
`using our data collection obtained from Comodo Cloud Security
`Center to fully evaluate the performance of our developed Valkyrie
`system: (1) In the first set of experiments, we evaluate the effective-
`ness of file content based classifier and file relation based classifier
`for malware detection; (2) In the second part of experiments, we
`evaluate our proposed semi-parametric model based classifier by
`comparing it with alternative methods for combining file content
`and file relations. (3) In the last set of experiments, we compare
`our Valkyrie system with some of the popular anti-malware soft-
`ware products such as Kaspersky Anti-Virus, MaAfee VirusScan,
`Bitdefender.
`5.1 Experimental Setup
`We measure the malware detection performance of different clas-
`sifiers using the following evaluation measures:
`• True Positive (TP): the number of samples correctly classi-
`fied as malicious files.
`• True Negative (TN): the number of samples correctly clas-
`sified as benign files.
`• False Positive (FP): the number of samples mistakenly clas-
`sified as malicious files.
`• False Negtive (FN): the number of samples mistakenly clas-
`sified as benign files.
`• Accuracy (ACY):
`• Recall (RC):
`T heN umberOf T otalF ileCollection .
`The dataset we obtained from Comodo Cloud Security Center
`includes 37,930 user file lists that describe file relations between
`30,950 malware samples, 225,830 benign files and 434,870 un-
`known files (analyzed by the anti-malware experts of Comodo Se-
`curity Lab, 39,138 of them are malware, while 395,732 of them
`are benign files). We also have the file relation information for all
`the file samples. Using the feature extraction methods described in
`
`T P +T N
`T P +T N +F P +F N
`
`T P +T N +F P +F N
`
`Section 3, based on this data collection, 1) resting on the API calls
`extracted from the known file samples, we obtain 210,850 training
`file content vectors (since part of the file samples’ Import Table are
`invalid, 23,610 malicious files can be effectively extracted their API
`calls, while 187,240 benign samples are successfully extracted)
`with 86,757 dimensions; 2) from the collected user file lists, after
`excluding the unexpected relations (like relations between infected
`samples and benign files), we construct a graph including 248,986
`vertices (29,006 represent malicious files, while 219,980 represent
`benign samples) with 356,134 edges.
`All the experimental studies are conducted under the environ-
`ment of Windows 7 operating system plus Intel(R) Core(TM) i3
`CPU and 4 GB of RAM.
`5.2 Comparisons of File Content and File Re-
`lation Based Classifiers
`In this set of experiments, we evaluate the effectiveness of mal-
`ware detection results based on different feature representations:
`file content and file relations. The large collection of file sample
`data along with the high dimensionality and sparseness requires
`the classification methods for malware detection to be scalable and
`robust. With the advantage of handling large feature space without
`overfitting, Support Vector Machine (SVM) has shown state-of-art
`results in classification problems [22, 12, 28]. Therefore, in this
`section, we use SVM [7] as the base classifier. For file content
`based classification, SVM is applied on the features of API calls.
`For file relation based classification, we treat file relations as the
`features for each file sample, i.e., the i-th feature is the similarities
`with the i-th file sample. Linear SVM [7] is used in both cases
`and the regularization parameter of SVM is selected using cross-
`validation.
`From Table 1 and Figure 8, we observe that the accuracy of file
`relation based classifier is similar to file content based classifier,
`while the recall of the file relation based classifier greatly outper-
`forms the file content based classifier for unknown file verdicts.
`
`Training
`F_Content
`F_Relation
`Testing
`F_Content
`F_Relation
`
`TP
`23,585
`27,018
`TP
`23,358
`25,969
`
`FP
`32
`880
`FP
`2,230
`6,880
`
`TN
`187,208
`219,100
`TN
`236,196
`312,100
`
`FN
`25
`1,988
`FN
`6,423
`9,988
`
`ACY
`0.9997
`0.9885
`ACY
`0.9677
`0.9525
`
`RC
`0.8211
`0.9696
`RC
`0.6168
`0.8162
`
`Table 1: Comparisons of File Content and File Relation Based
`Classifiers. Remark: "F_Content"-File Content based Classi-
`fier, "F_Relation"-File Relation based Classifier.
`
`5.3 Comparisons of Different Classifiers Com-
`bining File Content and File Relation
`In this section, we compare our semi-parametric model with the
`following methods of combining file relations and file content in-
`formation: (1) SVM on feature integration: We combine the con-
`tent features and the relation features and then apply SVM on the
`enlarged feature space. We use different weights for these two
`sets of features and the weights are selected using cross-validation.
`(2) SVM on kernel integration: We average the linear kernel on
`the content and the relation similarity (note that the co-occurrence
`graph can be viewed as a kernel) and apply SVM on the composite
`kernel. (3) joint-factorization: We use the supervised joint matrix
`factorization on both the content and relation information and then
`perform SVM on the resulting low dimensional embedding. For
`
`227
`
`IPR2023-00124
`CrowdStrike EX1015 Page 6
`
`
`
`Figure 8: Comparisons of File Content and File Relation Based
`Classifiers. Remark: "UM"-the number of malware from un-
`known file collection still unrecognized by classifier, "UB"–the
`number of benign files from unknown file collection still unrec-
`ognized by classifier.
`
`more details on this method, please refer to [31]. For our semi-
`parametric model, the parameters α, β, and γ are all set to 0.1.
`The results as shown in Table 2 and Figure 9 demonstrate that:
`(1) Combining file relation with file content can improve the classi-
`fication effectiveness for malware detection; (2) Combining file re-
`lation with file content, our proposed semi-parametric model based
`classifier outperforms other alternative methods.
`
`Figure 9: Comparisons of Different Classifiers Combining File
`Content and File Relation. Remark: "F_Content"-File Con-
`tent based Classifier, "F_Relation"-File Relation based Classi-
`fier, "CR_C1"-SVM on Feature Integration, "CR_C2"-SVM
`on Kernel Integration, "CR_C3"-Joint-factorization Classifier,
`"CR_SPM"-our proposed Semi-Parametric Model.
`
`Testing
`F_Content
`F_Relation
`CR_C1
`CR_C2
`CR_C3
`CR_SPM
`
`TP
`23,358
`25,969
`29,002
`28,123
`30,789
`34,675
`
`FP
`2,230
`6,880
`7,454
`8,358
`7,572
`563
`
`TN
`236,196
`312,100
`350,100
`349,196
`349,982
`356,991
`
`FN
`6,423
`9,988
`7,471
`8,350
`5,486
`1,798
`
`ACY
`0.9677
`0.9525
`0.9621
`0.9576
`0.9664
`0.9940
`
`RC
`0.6168
`0.8162
`0.9061
`0.9061
`0.9061
`0.9061
`
`AV.
`Kasp
`Nod32
`Mcafee
`BD
`Avira
`Valkyrie
`
`TP
`27,954
`26,589
`23,951
`28,763
`29,009
`34,675
`
`FP
`711
`923
`1,011
`780
`1,887
`563
`
`TN
`0
`0
`0
`0
`0
`356,991
`
`FN
`0
`0
`0
`0
`0
`1,798
`
`ACY
`0.9752
`0.9665
`0.9595
`0.9736
`0.9389
`0.9940
`
`RC
`0.0659
`0.0633
`0.0574
`0.0679
`0.0710
`0.9061
`
`Table 2: Comparisons of Different Classifiers Combining File
`Content and File Relation. Remark: "F_Content"-File Con-
`tent based Classifier, "F_Relation"-File Relation based Classi-
`fier, "CR_C1"-SVM on Feature Integration, "CR_C2"-SVM
`on Kernel Integration, "CR_C3"-Joint-factorization Classifier,
`"CR_SPM"-our proposed Semi-Parametric Model.
`
`5.4 Comparisons with Different AV Venders
`In this section, we apply Valkyrie system in real applications to
`evaluate its malware detection effectiveness and efficiency on the
`daily data collection described in Section 5.1.
`5.4.1 Comparisons of Detection Effectiveness between
`Different AV Venders
`Based on 434,870 unknown files (analyzed by the anti-malware
`experts of Comodo Security Lab, 39,138 of them are malware,
`while 395,732 of them are benign files), we first compare the mal-
`ware detection effectiveness of Valkyrie system with some of the
`popular AV products, like Kaspersky(Kasp), NOD32, Mcafee, Bit-
`defender(BD) and Avira. For comparison purpose, we use all of
`the Anti-Virus scanners’ latest versions of the base of signature on
`the same day(Feb 14th, 2011). Table 3 and Figure 10 show that the
`malware detection effectiveness of our Valkyrie outperforms other
`popular AV products based on our huge data collection.
`
`228
`
`Table 3: The malware detection results of different AV software
`products on the collection with 434,870 unknown files.
`
`Figure 10: The comparisons of malware detection results of
`different AV software products on the collection with 434,870
`unknown files.
`
`IPR2023-00124
`CrowdStrike EX1015 Page 7
`
`
`
`5.4.2 Comparisons of Detection Efficiency between
`Different AV Venders
`In this set



