throbber
Combining File Content and File Relations for Cloud
`Based Malware Detection
`
`Yanfang Ye
`Comodo Security Solutions
`Beijing, 100082, P.R.China
`yanfang@comodo.com
`
`Weiwei Zhuang
`Xiamen University
`Xiamen, 361005, P.R.China
`zhuangweiwei@gmail.com
`
`Shenghuo Zhu
`NEC Laboratories America
`Cupertino, CA, 95129, USA
`zsh@sv.nec-labs.com
`
`Tao Li
`School of Computer Science
`Florida International University
`Miami, FL, 33199, USA
`taoli@cs.fiu.edu
`Egemen Tas, Umesh
`Melih Abdulhayoglu
`Gupta
`Comodo Security Solutions
`New Jersey, NJ, 07310, USA
`Comodo Security Solutions
`melih@comodo.com
`New Jersey, NJ, 07310, USA
`{egemen,umesh}@comodo.com
`
`ABSTRACT
`Due to their damages to Internet security, malware (such as virus,
`worms, trojans, spyware, backdoors, and rootkits) detection has
`caught the attention not only of anti-malware industry but also of
`researchers for decades. Resting on the analysis of file contents
`extracted from the file samples, like Application Programming In-
`terface (API) calls, instruction sequences, and binary strings, data
`mining methods such as Naive Bayes and Support Vector Machines
`have been used for malware detection. However, besides file con-
`tents, relations among file samples, such as a “Downloader” is
`always associated with many Trojans, can provide invaluable in-
`formation about the properties of file samples. In this paper, we
`study how file relations can be used to improve malware detec-
`tion results and develop a file verdict system (named “Valkyrie”)
`building on a semi-parametric classifier model to combine file con-
`tent and file relations together for malware detection. To the best
`of our knowledge, this is the first work of using both file content
`and file relations for malware detection. A comprehensive exper-
`imental study on a large collection of PE files obtained from the
`clients of anti-malware products of Comodo Security Solutions In-
`corporation is performed to compare various malware detection ap-
`proaches. Promising experimental results demonstrate that the ac-
`curacy and efficiency of our Valkyrie system outperform other pop-
`ular anti-malware software tools such as Kaspersky AntiVirus and
`McAfee VirusScan, as well as other alternative data mining based
`detection systems. Our system has already been incorporated into
`the scanning tool of Comodo’s Anti-Malware software.
`
`Categories and Subject Descriptors
`I.2.6 [Artificial Intelligence]: Learning; D.4.6 [Operating Sys-
`tem]: Security and Protection - Invasive software
`
`General Terms
`Algorithms, Experimentation, Security
`
`Keywords
`cloud based malware detection, file content, file relation, semi-
`parametric model for learning from graph
`
`INTRODUCTION
`1.
`1.1 Cloud Based Malware Detection
`Malware is software designed to infiltrate or damage a computer
`system without the owner’s informed consent (e.g., virus, worms,
`trojans, spyware, backdoors, and rootkits) [23]. Numerous attacks
`made by the malware pose a major security threat to Internet users
`[8]. Hence, malware detection is one of the internet security top-
`ics that are of great interest [4, 25, 13, 15, 19, 20, 27, 24]. Cur-
`rently, the most significant line of defense against malware is anti-
`malware software products, such as Kaspersky, MacAfee and Co-
`modo’s Anti-Malware software. Typically, these widely used mal-
`ware detection software tools use the signature-based method to
`recognize threats. Signature is a short string of bytes, which is
`unique for each known malware so that its future examples can be
`correctly classified with a small error rate.
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`KDD’11, August 21–24, 2011, San Diego, California, USA.
`Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00.
`
`Figure 1: The Increment of Malware Samples (Data source:
`Comodo China Anti-Malware Lab).
`
`However, driven by the economic benefits, malware writers quickly
`invent counter-measures against proposed malware analysis tech-
`
`222
`
`IPR2023-00124
`CrowdStrike EX1015 Page 1
`
`

`

`Figure 2: The Workflow of Comodo Cloud Based Malware Detection Scheme.
`
`niques, chief among them being automated obfuscation [20]. Be-
`cause of automated obfuscation, today’s malware samples are cre-
`ated at a rate of thousands per day. Figure 1 shows the increasing
`trend of malware samples in P.R.China from Year 2003 to Year
`2010 (this data is provided by Comodo China Anti-Malware Lab).
`It can be observed that the number of malware samples has in-
`creased sharply since 2008. In fact the number of malware samples
`in 2008 alone is much larger than the total sum of previous five
`years.
`Nowdays malware samples increasingly employ techniques such
`as polymorphism [2], metamorphism [1], packing, instruction vir-
`tualization, and emulation to bypass signatures and defeat attempts
`to analyze their inner mechanisms [20]. In order to remain effec-
`tive, many Anti-Malware venders have turned their classic signature-
`based method to cloud (server) based detection. The work flow of
`the cloud based detection method adopted by Comodo Security So-
`lutions Incorporation is shown in Figure 2.
`The work flow of this cloud based malware detection scheme can
`be described as follows:
`
`1. On the client side, users may receive new files from emails,
`media or IM(Instant Message) tools when they are using the
`Internet.
`
`2. Anti-malware products will first use the signature set on the
`clients for scanning. If these new files are not detected by
`existing signatures, then they will be marked as “unknown”.
`
`3. In order to detect malware from the unknown file collection,
`file features (like file content as well as the file relations) are
`extracted and sent to Comodo Cloud Security Center.
`
`4. Based on these collected features, the classifier(s) on the cloud
`server will predict and generate the verdicts for the unknown
`file samples, either benign or malicious.
`
`5. Then the cloud server will send the verdict results to the
`clients and notify the clients immediately.
`
`6. According to the response from the cloud server, the scan-
`ning process can detect new malware samples and remove
`the threats.
`
`7. Due to the fast response from the cloud server, the client
`users can have most up-to-date security solutions.
`
`To sum-up, using the cloud-based architecture, malware detec-
`tion is now conducted in a client-server manner: authenticating
`valid software programs from a whitelist and blocking invalid soft-
`ware programs from a blacklist using the signature-based method
`at the client (user) side, and predicting any unknown software (i.e.,
`the gray list) at the cloud (server) side and quickly generating the
`verdict results to the clients within seconds. The gray list, contain-
`ing unknown software programs which could be either benign or
`malicious, was usually authenticated or rejected manually by mal-
`ware analysts before. With the development of the malware writing
`techniques, the number of file samples in the gray list that need to
`be analyzed by malware analysts on a daily basis is constantly in-
`creasing. For example, the gray list collected by the Anti-Malware
`Lab of Comodo Security Solutions Incorporation usually contains
`about 500,000 file samples per day. Therefore, there is an urgent
`need for anti-malware industry to develop intelligent methods for
`efficient and effective malware detection at the cloud (server) side.
`Recently, many research efforts have been conducted on devel-
`oping intelligent malware detection systems [4, 25, 13, 15, 19, 20,
`27, 24].
`In these systems, the detection process is generally di-
`vided into two steps: feature extraction and classification. In the
`first step, various features such as Application Programming Inter-
`face (API) calls [27] and program strings [13, 19, 18] are ex-
`tracted to capture the characteristics of the file samples.
`In the
`second step, intelligent classification techniques such as decision
`trees [17], Naive Bayes, and associative classifiers [24, 13, 19,
`27] are used to automatically classify the file samples into different
`classes based on computational analysis of the feature representa-
`tions. These intelligent malware detection systems are varied in
`their use of feature representations and classification methods. For
`example, IMDS [27] performs association classification on Win-
`dows API calls extracted from executable files, while Naive Bayes
`methods on the extracted strings and byte sequences are applied in
`[19].
`These intelligent techniques have isolated successes in classify-
`ing particular sets of malware samples, but they have limitations
`that leave a large room for improvement.
`In particular, none of
`these techniques have taken the relationships among file samples
`into consideration for malware detection. Simply treating file pro-
`grams as independent samples allows many off-the-shelf classifica-
`tion tools to be directly adapted for malware classification. How-
`ever, the relationships among file samples may imply the inter-
`dependence among them and thus the usual i.i.d (independent and
`identical distributed) assumption may not hold for malware sam-
`
`223
`
`IPR2023-00124
`CrowdStrike EX1015 Page 2
`
`

`

`ples. As a result, ignoring the relations among file samples is a
`significant limitation of current malware classification methods.
`1.2 Relations Among File Samples
`For malware detection, the relations among file samples provide
`invaluable information about their properties. Here we use some
`examples for illustration. Based on the collected file lists from
`clients, we construct a co-occurrence graph to describe the rela-
`tions among file samples. Generally, two files are related if they
`are shared by many clients (or equivalently, file lists). As shown
`in Figure 3, we can observe that the file “yy(1).exe” is associ-
`ated with many trojans which are marked as purple color. Ac-
`tually, this “yy(1).exe” file is a kind of Trojan-Downloader mal-
`ware. Trojan-Downloader refers to any malicious software that
`downloads and installs multiple unwanted applications of adware
`and malware from remote servers. Malware samples of this type
`are spread from malicious websites or by emails as attachments or
`links, and are installed secretly without the user’s consent. There-
`fore, from the relations shown in Figure 3, we can infer that if an
`unknown file always co-occurs with many kinds of trojans in users’
`computers, then most likely, it is a malicious Trojan-Downloader
`file.
`
`Figure 3: File Relations Between a Trojan-Downloader and its
`Related Trojans.
`
`Another example showing the relations among benign files is il-
`lustrated in Figure 4. From Figure 4, we can observe that an un-
`known file “everest.exe” can be possibly recognized as benign since
`it is always associated with known benign files marked in green
`color. Actually, this “everest.exe” is a benign system diagnostic
`application which always co-occurs with its related Dynamic Link
`Library files, such as, “everest_start.dll”, “everest_mondiag.dll”,
`“everest_rcs.dll” and so on.
`Sometime it is not easy to determine whether a file is malicious
`or not solely based on file content information itself. According
`to the experience and knowledge of our anti-malware experts, file
`relations among samples can be a novel and practical feature repre-
`sentation for malware detection. Some malware samples may have
`stronger connections with benign files than malicious ones. In such
`cases, those file samples might be infected files. Actually, these
`unexpected relations can be filtered and removed, because the in-
`fected samples can be detected independently using the infected file
`detector which is developed by our anti-malware experts.
`1.3 Combining File Content and File Relation
`To improve the performance of file sample classification for mal-
`ware detection, in this paper, we utilize both file content and file re-
`lation information. However, relation information and file content
`have different properties. Relation information provides a graph
`
`224
`
`Figure 4: File Relations Between a Benign Application and its
`Related Dynamic Link Library files.
`
`structure in the data and induces pairwise similarity between ob-
`jects while the file content provides inherent characteristic infor-
`mation about the file samples. Although both the relation infor-
`mation and file content can be used independently to classify file
`samples, classification algorithms that make use of them simulta-
`neously should be able to achieve a better performance.
`The problem of combining content information and relation in-
`formation (i.e., link information) have been widely studied for web
`document categorization in data mining and information retrieval
`community [26, 9]. The approaches for combining content and
`link information generally fall into two categories: (1) feature inte-
`gration which treats the relation information as additional features
`and enlarges the feature representation [3, 11, 16]; and (2) Kernel
`Integration which integrates the data at the similarity computation
`or the Kernel level [10, 14]. However, both types of approaches
`have limitations: feature integration may degrade the quality of in-
`formation as file relations and file content typically have different
`properties, while kernel integration fails to explore the correlation
`and the inherent consistency between the content information and
`the relation information [31].
`1.4 Contributions of Our Paper
`In this paper, we propose a semi-parametric classification model
`for combining file content and file relations. The semi-parametric
`model consists of two components: a parametric component re-
`flecting file content information and a non-parametric component
`reflecting file relation information. The model seamlessly inte-
`grates these two components and formulates the classification prob-
`lem using the graph regularization framework. Our model can be
`viewed as an extension of recently developed joint-embedding ap-
`proaches which aims to seek a common low-dimensional embed-
`ding via joint factorization of both the content and relation infor-
`mation [31, 5, 30]. However, different from the joint-embedding
`approaches, our model does not explicitly infer the embedding and
`is directly optimized for classification. We develop a file verdict
`system (named "Valkyrie") using the proposed model to integrate
`file content and file relations for malware detection. To the best
`of our knowledge, this is the first work of using both file content
`and file relations for malware detection. In short, our developed
`Valkyrie system has the following major traits:
`• Novel Usage of File Relation: Different from previous stud-
`ies for malware detection, we not only make use of file con-
`tent, but also use the file relations for malware detection.
`• A Principled Model for Combining File Content and File Re-
`lations: We propose a semi-parametric classification model
`
`IPR2023-00124
`CrowdStrike EX1015 Page 3
`
`

`

`to seamlessly combine file content and file relation, and for-
`mulate the classification problem using the graph regulariza-
`tion framework.
`• A Practical Developed System for Real Industry Application:
`Based on 37,930 clients, we obtain 30,950 malware samples,
`225,830 benign files and 434,870 unknown files from Co-
`modo Cloud Security Center. We build a practical system for
`malware detection and provide a comprehensive experimen-
`tal study.
`
`All these traits make our Valkyrie system a practical solution
`for automatic malware detection. The case studies on large and
`real daily malware collection from Comodo Cloud Security Center
`demonstrate the effectiveness and efficiency of our Valkyrie sys-
`tem. As a result, our Valkyrie system has already been incorporated
`into the scanning tool of Comodo’s Anti-Malware software.
`1.5 Organization of The Paper
`The rest of this paper is organized as follows. Section 2 presents
`the overview of our Valkyrie system. Section 3 describes the fea-
`ture extraction and representation; Section 4 introduces the pro-
`posed semi-parametric model combining file content and file rela-
`tions together for malware detection; In Section 5, using the daily
`data collection obtained from Comodo Cloud Security Center, we
`systematically evaluate the effectiveness and efficiency of our Valkyrie
`system in comparison with other proposed classification methods,
`as well as some of the popular Anti-Malware software such as
`Kaspersky and NOD32. Section 6 presents the details of system
`development and operation. Section 7 discusses the related work.
`Finally, Section 8 concludes the paper.
`
`2. SYSTEM ARCHITECTURE
`Figure 5 shows the system architecture of our Valkyrie system.
`We briefly describe each component below.
`
`• Training:
`1. User File List and File Sample Collector: It collects
`the file lists from the clients which contain the poten-
`tial relations between file samples, together with the file
`samples.
`2. File Content Feature Exactor: Besides its high ex-
`traction efficiency compared with dynamic feature rep-
`resentation methods, Application Programming Inter-
`faced (API) calls can well reflect the behaviors of pro-
`gram code pieces. Therefore, our developed file con-
`tent feature extractor extracts the API calls from the
`collected malicious and benign Windows Portable Ex-
`ecutable (PE) files. (See Section 3.1 for details.)
`3. File Relation Feature Exactor: Based on the collected
`file lists from clients, a co-occurrence graph is con-
`structed to describe the file relations. Note that many
`unexpected relations (like relations between infected
`samples and benign files) are removed using infected
`file detectors. (See Section 3.2 for details.)
`4. Semi-Parametric Model Based Classifier: Our pro-
`posed semi-parametric model integrates file content and
`relation information and formulates the classification
`problem using the graph regularization framework. (See
`Section 4 for details.)
`
`225
`
`Figure 5: The System Architecture of Valkyrie.
`
`• Prediction: On the clients, our Comodo Anti-Malware soft-
`ware products authenticate valid software from a whitelist
`and block invalid software from a blacklist using the signature-
`based method. The gray list, containing unknown software
`programs which could be either normal or malicious, is then
`fed into our Valkyrie system. After file content and file rela-
`tion feature extractions, the semi-parametric model is applied
`to the gray list for prediction.
`
`3. FEATURE EXTRACTION
`Our Valkyrie system is performed directly on Windows Portable
`Executable (PE) codes. PE is designed as a common file format
`for all flavor of Windows operating system, and PE malware are
`in the majority of the malware rising in recent years [27]. In this
`section, we will introduce both file content and file relation feature
`extraction methods we adopted.
`3.1 File Content
`We extract the Application Programming Interface (API) calls
`from the Import Tables [27] of collected malicious and benign PE
`files, convert them to a group of 32-bit global IDs (for example, the
`API "MAPI32.MAPIReadMail" in encoded as 0X00000F12) as the
`content features, and stores these features in the signature database.
`A sample file content signature database is shown in Figure 6, in
`which there are 6 fields: record ID, PE file name, file type ("-1"
`represents benign file while "1" is for malicious file), called APIs
`name, called API ID, the total number of called API functions.
`3.2 File Relations
`Based on the collected file lists from clients, we construct a
`
`IPR2023-00124
`CrowdStrike EX1015 Page 4
`
`

`

`components: a parametric component reflecting file content infor-
`mation and a non-parametric component reflecting file relation in-
`formation.
`Let f be a vector, each of whose elements is the label (i.e., mali-
`cious or benign) of a file example to be predicted. The vector f can
`be generated from two parts, parametric and non-parametric ones.
`The parametric component follows a linear model, X!w, where
`each column of matrix X is the content feature vector of a file ex-
`ample, and w is the coefficients. The non-parametric part is just a
`vector of h, each element of which corresponds to a value of a file
`example. Combining two parts, we have f = X!w + h.
`Now considering the labeling information vector y. Let yi = 1
`if the i-th file sample is malicious, yi = −1 if the i-th file sample
`is benign, yi = 0 if the i-th file sample is unlabeled. We can use
`hinge loss for labeled file samples as in Support Vector Machine,
`or use L2 loss for labeled data as in Least Square problems. For
`simplicity, we follow [29] to use L2 loss on all data points, i.e. $y−
`f$2. We also consider the global consistency on the co-occurrence
`graph [29], f !Lf, where the symmetric matrix L is the normalized
`Laplacian of the graph. Thus the total loss is
`$y − f$2 +
`where α is the weight for combining two parts of information,
`2 is just for convenience.
`adding 1
`To limit the model complexity, we add the regularization terms
`for w and h, which are
`
`f!Lf ,
`
`α 2
`
`1 2
`
`(2)
`
`w!w +
`
`h!h,
`
`1
`1
`2γ
`2β
`where β and γ are the regularization parameters.
`Putting Eq. (2) and Eq. (3) together, we have optimization prob-
`lem:
`
`(3)
`
`f !Lf +
`
`1
`2β
`
`w!w +
`
`1
`2γ
`
`h!h (4)
`
`α 2
`
`1 2
`
`$y − f$2 +
`f = X!w + h.
`
`min
`f ,w,h
`subject to
`
`To solve Eq. (4), we introduce Lagrange multiplier ξ.
`1
`$y − f$2 +
`2β
`1
`h!h + ξ!(f − X!w − h).
`+
`2γ
`As ∂L∂w = 0, ∂L∂h = 0, ∂L∂ξ = 0, and ∂L∂f = 0, we have
`
`f!Lf +
`
`w!w
`
`α 2
`
`1 2
`
`L(f , w, h; ξ) =
`
`Figure 6: Sample File Content Features in the Signature
`Database.
`
`co-occurrence graph to describe the relations among file samples.
`Generally, two files are related if they are shared by many clients
`(or equivalently, file lists). Note that many unexpected relations
`(like relations between infected samples and benign files) are first
`removed using infected file detectors.
`The co-occurrence graph is defined as G =< V, E > where V
`is the set of file samples. Given two file samples vi and vj, let Si
`be the set of file lists containing vi and Sj be the set of file lists
`containing vj. Then the similarity between vi and vj is computed
`as
`
`,
`
`sim(vi, vj) = |Si ∩ Sj|
`|Si ∪ Sj|
`where |S| denotes the size of a set S. If the similarity between a
`pair of file samples is greater than 0, then there is an edge between
`them and E is the set of edges between vertices.
`An example graph is shown in Figure 7 illustrating the real re-
`lations between some file samples, where the size of each edge
`indicates its weight.
`
`(1)
`
`Figure 7: An example graph of real relations between some
`file samples (purple color-malware samples, green color-benign
`files, transparent color-unknown files).
`
`Plugging Eqs. (5,6) into Eq. (7), we have
`f = (βX!X + γI)ξ,
`
`w = βXξ
`h = γξ
`f = X!w + h
`y = f + αLf + ξ
`
`(5)
`(6)
`(7)
`(8)
`
`4. A SEMI-PARAMETRIC MODEL FOR COM-
`BINING FILE CONTENT AND FILE RE-
`LATIONS
`In this section, we propose a semi-parametric model to combine
`file content and file relations for classification using the graph reg-
`ularization framework. The semi-parametric model consists of two
`
`226
`
`or
`
`ξ = (βX!X + γI)−1f .
`Plugging it into Eq. (8), we have
`
`f =!I + αL + (βX!X + γI)−1"−1
`
`y.
`
`(9)
`
`This model is an extension of [29] by consider the parametric part.
`Note that if there are no content features, then f = h and this
`
`IPR2023-00124
`CrowdStrike EX1015 Page 5
`
`

`

`model reduces to traditional semi-supervised learning. Different
`from [31] and [30], this model does not infer the embedding.
`Computation Issues: We need to solve
`
`!I + αL + (βX!X + γI)−1" f = y
`(10)
`Let the size of X be p × n, where p is the number of feature and
`n is the number of instances, the average nonzeros element of L be
`κ. As long as p & n, we can follow the Woodbury identity, and
`Eq. (10) becomes
`!(1 + γ−1)I + αL − γ−1X(γβ−1I + XX!)−1X!" f = y
`
`(11)
`To solve Eq. (11), we can use conjugate gradient descent method.
`Computing XX! is O(np2), the inverse of (γβ−1I + XX!) is
`O(p3), and we precompute (γβ−1I+XX!)−1X! with O(np2 +
`p3). In each iteration of conjugate gradient descent, we compute
`
`#(1 + γ−1)I + αL − γ−1X(γβ−1I + XX!)−1X!$ v for some
`
`v. The computation of each iteration is O(n(p + κ)). The conver-
`gence rate depends on the condition number of the LHS matrix of
`Eq. (11).
`
`5. EXPERIMENTAL RESULTS
`AND ANALYSIS
`In this section, we conduct three sets of experimental studies
`using our data collection obtained from Comodo Cloud Security
`Center to fully evaluate the performance of our developed Valkyrie
`system: (1) In the first set of experiments, we evaluate the effective-
`ness of file content based classifier and file relation based classifier
`for malware detection; (2) In the second part of experiments, we
`evaluate our proposed semi-parametric model based classifier by
`comparing it with alternative methods for combining file content
`and file relations. (3) In the last set of experiments, we compare
`our Valkyrie system with some of the popular anti-malware soft-
`ware products such as Kaspersky Anti-Virus, MaAfee VirusScan,
`Bitdefender.
`5.1 Experimental Setup
`We measure the malware detection performance of different clas-
`sifiers using the following evaluation measures:
`• True Positive (TP): the number of samples correctly classi-
`fied as malicious files.
`• True Negative (TN): the number of samples correctly clas-
`sified as benign files.
`• False Positive (FP): the number of samples mistakenly clas-
`sified as malicious files.
`• False Negtive (FN): the number of samples mistakenly clas-
`sified as benign files.
`• Accuracy (ACY):
`• Recall (RC):
`T heN umberOf T otalF ileCollection .
`The dataset we obtained from Comodo Cloud Security Center
`includes 37,930 user file lists that describe file relations between
`30,950 malware samples, 225,830 benign files and 434,870 un-
`known files (analyzed by the anti-malware experts of Comodo Se-
`curity Lab, 39,138 of them are malware, while 395,732 of them
`are benign files). We also have the file relation information for all
`the file samples. Using the feature extraction methods described in
`
`T P +T N
`T P +T N +F P +F N
`
`T P +T N +F P +F N
`
`Section 3, based on this data collection, 1) resting on the API calls
`extracted from the known file samples, we obtain 210,850 training
`file content vectors (since part of the file samples’ Import Table are
`invalid, 23,610 malicious files can be effectively extracted their API
`calls, while 187,240 benign samples are successfully extracted)
`with 86,757 dimensions; 2) from the collected user file lists, after
`excluding the unexpected relations (like relations between infected
`samples and benign files), we construct a graph including 248,986
`vertices (29,006 represent malicious files, while 219,980 represent
`benign samples) with 356,134 edges.
`All the experimental studies are conducted under the environ-
`ment of Windows 7 operating system plus Intel(R) Core(TM) i3
`CPU and 4 GB of RAM.
`5.2 Comparisons of File Content and File Re-
`lation Based Classifiers
`In this set of experiments, we evaluate the effectiveness of mal-
`ware detection results based on different feature representations:
`file content and file relations. The large collection of file sample
`data along with the high dimensionality and sparseness requires
`the classification methods for malware detection to be scalable and
`robust. With the advantage of handling large feature space without
`overfitting, Support Vector Machine (SVM) has shown state-of-art
`results in classification problems [22, 12, 28]. Therefore, in this
`section, we use SVM [7] as the base classifier. For file content
`based classification, SVM is applied on the features of API calls.
`For file relation based classification, we treat file relations as the
`features for each file sample, i.e., the i-th feature is the similarities
`with the i-th file sample. Linear SVM [7] is used in both cases
`and the regularization parameter of SVM is selected using cross-
`validation.
`From Table 1 and Figure 8, we observe that the accuracy of file
`relation based classifier is similar to file content based classifier,
`while the recall of the file relation based classifier greatly outper-
`forms the file content based classifier for unknown file verdicts.
`
`Training
`F_Content
`F_Relation
`Testing
`F_Content
`F_Relation
`
`TP
`23,585
`27,018
`TP
`23,358
`25,969
`
`FP
`32
`880
`FP
`2,230
`6,880
`
`TN
`187,208
`219,100
`TN
`236,196
`312,100
`
`FN
`25
`1,988
`FN
`6,423
`9,988
`
`ACY
`0.9997
`0.9885
`ACY
`0.9677
`0.9525
`
`RC
`0.8211
`0.9696
`RC
`0.6168
`0.8162
`
`Table 1: Comparisons of File Content and File Relation Based
`Classifiers. Remark: "F_Content"-File Content based Classi-
`fier, "F_Relation"-File Relation based Classifier.
`
`5.3 Comparisons of Different Classifiers Com-
`bining File Content and File Relation
`In this section, we compare our semi-parametric model with the
`following methods of combining file relations and file content in-
`formation: (1) SVM on feature integration: We combine the con-
`tent features and the relation features and then apply SVM on the
`enlarged feature space. We use different weights for these two
`sets of features and the weights are selected using cross-validation.
`(2) SVM on kernel integration: We average the linear kernel on
`the content and the relation similarity (note that the co-occurrence
`graph can be viewed as a kernel) and apply SVM on the composite
`kernel. (3) joint-factorization: We use the supervised joint matrix
`factorization on both the content and relation information and then
`perform SVM on the resulting low dimensional embedding. For
`
`227
`
`IPR2023-00124
`CrowdStrike EX1015 Page 6
`
`

`

`Figure 8: Comparisons of File Content and File Relation Based
`Classifiers. Remark: "UM"-the number of malware from un-
`known file collection still unrecognized by classifier, "UB"–the
`number of benign files from unknown file collection still unrec-
`ognized by classifier.
`
`more details on this method, please refer to [31]. For our semi-
`parametric model, the parameters α, β, and γ are all set to 0.1.
`The results as shown in Table 2 and Figure 9 demonstrate that:
`(1) Combining file relation with file content can improve the classi-
`fication effectiveness for malware detection; (2) Combining file re-
`lation with file content, our proposed semi-parametric model based
`classifier outperforms other alternative methods.
`
`Figure 9: Comparisons of Different Classifiers Combining File
`Content and File Relation. Remark: "F_Content"-File Con-
`tent based Classifier, "F_Relation"-File Relation based Classi-
`fier, "CR_C1"-SVM on Feature Integration, "CR_C2"-SVM
`on Kernel Integration, "CR_C3"-Joint-factorization Classifier,
`"CR_SPM"-our proposed Semi-Parametric Model.
`
`Testing
`F_Content
`F_Relation
`CR_C1
`CR_C2
`CR_C3
`CR_SPM
`
`TP
`23,358
`25,969
`29,002
`28,123
`30,789
`34,675
`
`FP
`2,230
`6,880
`7,454
`8,358
`7,572
`563
`
`TN
`236,196
`312,100
`350,100
`349,196
`349,982
`356,991
`
`FN
`6,423
`9,988
`7,471
`8,350
`5,486
`1,798
`
`ACY
`0.9677
`0.9525
`0.9621
`0.9576
`0.9664
`0.9940
`
`RC
`0.6168
`0.8162
`0.9061
`0.9061
`0.9061
`0.9061
`
`AV.
`Kasp
`Nod32
`Mcafee
`BD
`Avira
`Valkyrie
`
`TP
`27,954
`26,589
`23,951
`28,763
`29,009
`34,675
`
`FP
`711
`923
`1,011
`780
`1,887
`563
`
`TN
`0
`0
`0
`0
`0
`356,991
`
`FN
`0
`0
`0
`0
`0
`1,798
`
`ACY
`0.9752
`0.9665
`0.9595
`0.9736
`0.9389
`0.9940
`
`RC
`0.0659
`0.0633
`0.0574
`0.0679
`0.0710
`0.9061
`
`Table 2: Comparisons of Different Classifiers Combining File
`Content and File Relation. Remark: "F_Content"-File Con-
`tent based Classifier, "F_Relation"-File Relation based Classi-
`fier, "CR_C1"-SVM on Feature Integration, "CR_C2"-SVM
`on Kernel Integration, "CR_C3"-Joint-factorization Classifier,
`"CR_SPM"-our proposed Semi-Parametric Model.
`
`5.4 Comparisons with Different AV Venders
`In this section, we apply Valkyrie system in real applications to
`evaluate its malware detection effectiveness and efficiency on the
`daily data collection described in Section 5.1.
`5.4.1 Comparisons of Detection Effectiveness between
`Different AV Venders
`Based on 434,870 unknown files (analyzed by the anti-malware
`experts of Comodo Security Lab, 39,138 of them are malware,
`while 395,732 of them are benign files), we first compare the mal-
`ware detection effectiveness of Valkyrie system with some of the
`popular AV products, like Kaspersky(Kasp), NOD32, Mcafee, Bit-
`defender(BD) and Avira. For comparison purpose, we use all of
`the Anti-Virus scanners’ latest versions of the base of signature on
`the same day(Feb 14th, 2011). Table 3 and Figure 10 show that the
`malware detection effectiveness of our Valkyrie outperforms other
`popular AV products based on our huge data collection.
`
`228
`
`Table 3: The malware detection results of different AV software
`products on the collection with 434,870 unknown files.
`
`Figure 10: The comparisons of malware detection results of
`different AV software products on the collection with 434,870
`unknown files.
`
`IPR2023-00124
`CrowdStrike EX1015 Page 7
`
`

`

`5.4.2 Comparisons of Detection Efficiency between
`Different AV Venders
`In this set

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket