IPR2023-00124, No. 1015 Exhibit - EX1015 Ye, et al Combining File Content and File Relations for Cloud Based (P.T.A.B. Nov. 30, 2022)

Combining File Content and File Relations for Cloud
`Based Malware Detection
`
`Yanfang Ye
`Comodo Security Solutions
`Beijing, 100082, P.R.China
`yanfang@comodo.com
`
`Weiwei Zhuang
`Xiamen University
`Xiamen, 361005, P.R.China
`zhuangweiwei@gmail.com
`
`Shenghuo Zhu
`NEC Laboratories America
`Cupertino, CA, 95129, USA
`zsh@sv.nec-labs.com
`
`Tao Li
`School of Computer Science
`Florida International University
`Miami, FL, 33199, USA
`taoli@cs.ﬁu.edu
`Egemen Tas, Umesh
`Melih Abdulhayoglu
`Gupta
`Comodo Security Solutions
`New Jersey, NJ, 07310, USA
`Comodo Security Solutions
`melih@comodo.com
`New Jersey, NJ, 07310, USA
`{egemen,umesh}@comodo.com
`
`ABSTRACT
`Due to their damages to Internet security, malware (such as virus,
`worms, trojans, spyware, backdoors, and rootkits) detection has
`caught the attention not only of anti-malware industry but also of
`researchers for decades. Resting on the analysis of ﬁle contents
`extracted from the ﬁle samples, like Application Programming In-
`terface (API) calls, instruction sequences, and binary strings, data
`mining methods such as Naive Bayes and Support Vector Machines
`have been used for malware detection. However, besides ﬁle con-
`tents, relations among ﬁle samples, such as a “Downloader” is
`always associated with many Trojans, can provide invaluable in-
`formation about the properties of ﬁle samples. In this paper, we
`study how ﬁle relations can be used to improve malware detec-
`tion results and develop a ﬁle verdict system (named “Valkyrie”)
`building on a semi-parametric classiﬁer model to combine ﬁle con-
`tent and ﬁle relations together for malware detection. To the best
`of our knowledge, this is the ﬁrst work of using both ﬁle content
`and ﬁle relations for malware detection. A comprehensive exper-
`imental study on a large collection of PE ﬁles obtained from the
`clients of anti-malware products of Comodo Security Solutions In-
`corporation is performed to compare various malware detection ap-
`proaches. Promising experimental results demonstrate that the ac-
`curacy and efﬁciency of our Valkyrie system outperform other pop-
`ular anti-malware software tools such as Kaspersky AntiVirus and
`McAfee VirusScan, as well as other alternative data mining based
`detection systems. Our system has already been incorporated into
`the scanning tool of Comodo’s Anti-Malware software.
`
`Categories and Subject Descriptors
`I.2.6 [Artiﬁcial Intelligence]: Learning; D.4.6 [Operating Sys-
`tem]: Security and Protection - Invasive software
`
`General Terms
`Algorithms, Experimentation, Security
`
`Keywords
`cloud based malware detection, ﬁle content, ﬁle relation, semi-
`parametric model for learning from graph
`
`INTRODUCTION
`1.
`1.1 Cloud Based Malware Detection
`Malware is software designed to inﬁltrate or damage a computer
`system without the owner’s informed consent (e.g., virus, worms,
`trojans, spyware, backdoors, and rootkits) [23]. Numerous attacks
`made by the malware pose a major security threat to Internet users
`[8]. Hence, malware detection is one of the internet security top-
`ics that are of great interest [4, 25, 13, 15, 19, 20, 27, 24]. Cur-
`rently, the most signiﬁcant line of defense against malware is anti-
`malware software products, such as Kaspersky, MacAfee and Co-
`modo’s Anti-Malware software. Typically, these widely used mal-
`ware detection software tools use the signature-based method to
`recognize threats. Signature is a short string of bytes, which is
`unique for each known malware so that its future examples can be
`correctly classiﬁed with a small error rate.
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for proﬁt or commercial advantage and that copies
`bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior speciﬁc
`permission and/or a fee.
`KDD’11, August 21–24, 2011, San Diego, California, USA.
`Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00.
`
`Figure 1: The Increment of Malware Samples (Data source:
`Comodo China Anti-Malware Lab).
`
`However, driven by the economic beneﬁts, malware writers quickly
`invent counter-measures against proposed malware analysis tech-
`
`222
`
`IPR2023-00124
`CrowdStrike EX1015 Page 1
`
`

`Figure 2: The Workﬂow of Comodo Cloud Based Malware Detection Scheme.
`
`niques, chief among them being automated obfuscation [20]. Be-
`cause of automated obfuscation, today’s malware samples are cre-
`ated at a rate of thousands per day. Figure 1 shows the increasing
`trend of malware samples in P.R.China from Year 2003 to Year
`2010 (this data is provided by Comodo China Anti-Malware Lab).
`It can be observed that the number of malware samples has in-
`creased sharply since 2008. In fact the number of malware samples
`in 2008 alone is much larger than the total sum of previous ﬁve
`years.
`Nowdays malware samples increasingly employ techniques such
`as polymorphism [2], metamorphism [1], packing, instruction vir-
`tualization, and emulation to bypass signatures and defeat attempts
`to analyze their inner mechanisms [20]. In order to remain effec-
`tive, many Anti-Malware venders have turned their classic signature-
`based method to cloud (server) based detection. The work ﬂow of
`the cloud based detection method adopted by Comodo Security So-
`lutions Incorporation is shown in Figure 2.
`The work ﬂow of this cloud based malware detection scheme can
`be described as follows:
`
`1. On the client side, users may receive new ﬁles from emails,
`media or IM(Instant Message) tools when they are using the
`Internet.
`
`2. Anti-malware products will ﬁrst use the signature set on the
`clients for scanning. If these new ﬁles are not detected by
`existing signatures, then they will be marked as “unknown”.
`
`3. In order to detect malware from the unknown ﬁle collection,
`ﬁle features (like ﬁle content as well as the ﬁle relations) are
`extracted and sent to Comodo Cloud Security Center.
`
`4. Based on these collected features, the classiﬁer(s) on the cloud
`server will predict and generate the verdicts for the unknown
`ﬁle samples, either benign or malicious.
`
`5. Then the cloud server will send the verdict results to the
`clients and notify the clients immediately.
`
`6. According to the response from the cloud server, the scan-
`ning process can detect new malware samples and remove
`the threats.
`
`7. Due to the fast response from the cloud server, the client
`users can have most up-to-date security solutions.
`
`To sum-up, using the cloud-based architecture, malware detec-
`tion is now conducted in a client-server manner: authenticating
`valid software programs from a whitelist and blocking invalid soft-
`ware programs from a blacklist using the signature-based method
`at the client (user) side, and predicting any unknown software (i.e.,
`the gray list) at the cloud (server) side and quickly generating the
`verdict results to the clients within seconds. The gray list, contain-
`ing unknown software programs which could be either benign or
`malicious, was usually authenticated or rejected manually by mal-
`ware analysts before. With the development of the malware writing
`techniques, the number of ﬁle samples in the gray list that need to
`be analyzed by malware analysts on a daily basis is constantly in-
`creasing. For example, the gray list collected by the Anti-Malware
`Lab of Comodo Security Solutions Incorporation usually contains
`about 500,000 ﬁle samples per day. Therefore, there is an urgent
`need for anti-malware industry to develop intelligent methods for
`efﬁcient and effective malware detection at the cloud (server) side.
`Recently, many research efforts have been conducted on devel-
`oping intelligent malware detection systems [4, 25, 13, 15, 19, 20,
`27, 24].
`In these systems, the detection process is generally di-
`vided into two steps: feature extraction and classiﬁcation. In the
`ﬁrst step, various features such as Application Programming Inter-
`face (API) calls [27] and program strings [13, 19, 18] are ex-
`tracted to capture the characteristics of the ﬁle samples.
`In the
`second step, intelligent classiﬁcation techniques such as decision
`trees [17], Naive Bayes, and associative classiﬁers [24, 13, 19,
`27] are used to automatically classify the ﬁle samples into different
`classes based on computational analysis of the feature representa-
`tions. These intelligent malware detection systems are varied in
`their use of feature representations and classiﬁcation methods. For
`example, IMDS [27] performs association classiﬁcation on Win-
`dows API calls extracted from executable ﬁles, while Naive Bayes
`methods on the extracted strings and byte sequences are applied in
`[19].
`These intelligent techniques have isolated successes in classify-
`ing particular sets of malware samples, but they have limitations
`that leave a large room for improvement.
`In particular, none of
`these techniques have taken the relationships among ﬁle samples
`into consideration for malware detection. Simply treating ﬁle pro-
`grams as independent samples allows many off-the-shelf classiﬁca-
`tion tools to be directly adapted for malware classiﬁcation. How-
`ever, the relationships among ﬁle samples may imply the inter-
`dependence among them and thus the usual i.i.d (independent and
`identical distributed) assumption may not hold for malware sam-
`
`223
`
`IPR2023-00124
`CrowdStrike EX1015 Page 2
`
`

`ples. As a result, ignoring the relations among ﬁle samples is a
`signiﬁcant limitation of current malware classiﬁcation methods.
`1.2 Relations Among File Samples
`For malware detection, the relations among ﬁle samples provide
`invaluable information about their properties. Here we use some
`examples for illustration. Based on the collected ﬁle lists from
`clients, we construct a co-occurrence graph to describe the rela-
`tions among ﬁle samples. Generally, two ﬁles are related if they
`are shared by many clients (or equivalently, ﬁle lists). As shown
`in Figure 3, we can observe that the ﬁle “yy(1).exe” is associ-
`ated with many trojans which are marked as purple color. Ac-
`tually, this “yy(1).exe” ﬁle is a kind of Trojan-Downloader mal-
`ware. Trojan-Downloader refers to any malicious software that
`downloads and installs multiple unwanted applications of adware
`and malware from remote servers. Malware samples of this type
`are spread from malicious websites or by emails as attachments or
`links, and are installed secretly without the user’s consent. There-
`fore, from the relations shown in Figure 3, we can infer that if an
`unknown ﬁle always co-occurs with many kinds of trojans in users’
`computers, then most likely, it is a malicious Trojan-Downloader
`ﬁle.
`
`Figure 3: File Relations Between a Trojan-Downloader and its
`Related Trojans.
`
`Another example showing the relations among benign ﬁles is il-
`lustrated in Figure 4. From Figure 4, we can observe that an un-
`known ﬁle “everest.exe” can be possibly recognized as benign since
`it is always associated with known benign ﬁles marked in green
`color. Actually, this “everest.exe” is a benign system diagnostic
`application which always co-occurs with its related Dynamic Link
`Library ﬁles, such as, “everest_start.dll”, “everest_mondiag.dll”,
`“everest_rcs.dll” and so on.
`Sometime it is not easy to determine whether a ﬁle is malicious
`or not solely based on ﬁle content information itself. According
`to the experience and knowledge of our anti-malware experts, ﬁle
`relations among samples can be a novel and practical feature repre-
`sentation for malware detection. Some malware samples may have
`stronger connections with benign ﬁles than malicious ones. In such
`cases, those ﬁle samples might be infected ﬁles. Actually, these
`unexpected relations can be ﬁltered and removed, because the in-
`fected samples can be detected independently using the infected ﬁle
`detector which is developed by our anti-malware experts.
`1.3 Combining File Content and File Relation
`To improve the performance of ﬁle sample classiﬁcation for mal-
`ware detection, in this paper, we utilize both ﬁle content and ﬁle re-
`lation information. However, relation information and ﬁle content
`have different properties. Relation information provides a graph
`
`224
`
`Figure 4: File Relations Between a Benign Application and its
`Related Dynamic Link Library ﬁles.
`
`structure in the data and induces pairwise similarity between ob-
`jects while the ﬁle content provides inherent characteristic infor-
`mation about the ﬁle samples. Although both the relation infor-
`mation and ﬁle content can be used independently to classify ﬁle
`samples, classiﬁcation algorithms that make use of them simulta-
`neously should be able to achieve a better performance.
`The problem of combining content information and relation in-
`formation (i.e., link information) have been widely studied for web
`document categorization in data mining and information retrieval
`community [26, 9]. The approaches for combining content and
`link information generally fall into two categories: (1) feature inte-
`gration which treats the relation information as additional features
`and enlarges the feature representation [3, 11, 16]; and (2) Kernel
`Integration which integrates the data at the similarity computation
`or the Kernel level [10, 14]. However, both types of approaches
`have limitations: feature integration may degrade the quality of in-
`formation as ﬁle relations and ﬁle content typically have different
`properties, while kernel integration fails to explore the correlation
`and the inherent consistency between the content information and
`the relation information [31].
`1.4 Contributions of Our Paper
`In this paper, we propose a semi-parametric classiﬁcation model
`for combining ﬁle content and ﬁle relations. The semi-parametric
`model consists of two components: a parametric component re-
`ﬂecting ﬁle content information and a non-parametric component
`reﬂecting ﬁle relation information. The model seamlessly inte-
`grates these two components and formulates the classiﬁcation prob-
`lem using the graph regularization framework. Our model can be
`viewed as an extension of recently developed joint-embedding ap-
`proaches which aims to seek a common low-dimensional embed-
`ding via joint factorization of both the content and relation infor-
`mation [31, 5, 30]. However, different from the joint-embedding
`approaches, our model does not explicitly infer the embedding and
`is directly optimized for classiﬁcation. We develop a ﬁle verdict
`system (named "Valkyrie") using the proposed model to integrate
`ﬁle content and ﬁle relations for malware detection. To the best
`of our knowledge, this is the ﬁrst work of using both ﬁle content
`and ﬁle relations for malware detection. In short, our developed
`Valkyrie system has the following major traits:
`• Novel Usage of File Relation: Different from previous stud-
`ies for malware detection, we not only make use of ﬁle con-
`tent, but also use the ﬁle relations for malware detection.
`• A Principled Model for Combining File Content and File Re-
`lations: We propose a semi-parametric classiﬁcation model
`
`IPR2023-00124
`CrowdStrike EX1015 Page 3
`
`

`to seamlessly combine ﬁle content and ﬁle relation, and for-
`mulate the classiﬁcation problem using the graph regulariza-
`tion framework.
`• A Practical Developed System for Real Industry Application:
`Based on 37,930 clients, we obtain 30,950 malware samples,
`225,830 benign ﬁles and 434,870 unknown ﬁles from Co-
`modo Cloud Security Center. We build a practical system for
`malware detection and provide a comprehensive experimen-
`tal study.
`
`All these traits make our Valkyrie system a practical solution
`for automatic malware detection. The case studies on large and
`real daily malware collection from Comodo Cloud Security Center
`demonstrate the effectiveness and efﬁciency of our Valkyrie sys-
`tem. As a result, our Valkyrie system has already been incorporated
`into the scanning tool of Comodo’s Anti-Malware software.
`1.5 Organization of The Paper
`The rest of this paper is organized as follows. Section 2 presents
`the overview of our Valkyrie system. Section 3 describes the fea-
`ture extraction and representation; Section 4 introduces the pro-
`posed semi-parametric model combining ﬁle content and ﬁle rela-
`tions together for malware detection; In Section 5, using the daily
`data collection obtained from Comodo Cloud Security Center, we
`systematically evaluate the effectiveness and efﬁciency of our Valkyrie
`system in comparison with other proposed classiﬁcation methods,
`as well as some of the popular Anti-Malware software such as
`Kaspersky and NOD32. Section 6 presents the details of system
`development and operation. Section 7 discusses the related work.
`Finally, Section 8 concludes the paper.
`
`2. SYSTEM ARCHITECTURE
`Figure 5 shows the system architecture of our Valkyrie system.
`We brieﬂy describe each component below.
`
`• Training:
`1. User File List and File Sample Collector: It collects
`the ﬁle lists from the clients which contain the poten-
`tial relations between ﬁle samples, together with the ﬁle
`samples.
`2. File Content Feature Exactor: Besides its high ex-
`traction efﬁciency compared with dynamic feature rep-
`resentation methods, Application Programming Inter-
`faced (API) calls can well reﬂect the behaviors of pro-
`gram code pieces. Therefore, our developed ﬁle con-
`tent feature extractor extracts the API calls from the
`collected malicious and benign Windows Portable Ex-
`ecutable (PE) ﬁles. (See Section 3.1 for details.)
`3. File Relation Feature Exactor: Based on the collected
`ﬁle lists from clients, a co-occurrence graph is con-
`structed to describe the ﬁle relations. Note that many
`unexpected relations (like relations between infected
`samples and benign ﬁles) are removed using infected
`ﬁle detectors. (See Section 3.2 for details.)
`4. Semi-Parametric Model Based Classiﬁer: Our pro-
`posed semi-parametric model integrates ﬁle content and
`relation information and formulates the classiﬁcation
`problem using the graph regularization framework. (See
`Section 4 for details.)
`
`225
`
`Figure 5: The System Architecture of Valkyrie.
`
`• Prediction: On the clients, our Comodo Anti-Malware soft-
`ware products authenticate valid software from a whitelist
`and block invalid software from a blacklist using the signature-
`based method. The gray list, containing unknown software
`programs which could be either normal or malicious, is then
`fed into our Valkyrie system. After ﬁle content and ﬁle rela-
`tion feature extractions, the semi-parametric model is applied
`to the gray list for prediction.
`
`3. FEATURE EXTRACTION
`Our Valkyrie system is performed directly on Windows Portable
`Executable (PE) codes. PE is designed as a common ﬁle format
`for all ﬂavor of Windows operating system, and PE malware are
`in the majority of the malware rising in recent years [27]. In this
`section, we will introduce both ﬁle content and ﬁle relation feature
`extraction methods we adopted.
`3.1 File Content
`We extract the Application Programming Interface (API) calls
`from the Import Tables [27] of collected malicious and benign PE
`ﬁles, convert them to a group of 32-bit global IDs (for example, the
`API "MAPI32.MAPIReadMail" in encoded as 0X00000F12) as the
`content features, and stores these features in the signature database.
`A sample ﬁle content signature database is shown in Figure 6, in
`which there are 6 ﬁelds: record ID, PE ﬁle name, ﬁle type ("-1"
`represents benign ﬁle while "1" is for malicious ﬁle), called APIs
`name, called API ID, the total number of called API functions.
`3.2 File Relations
`Based on the collected ﬁle lists from clients, we construct a
`
`IPR2023-00124
`CrowdStrike EX1015 Page 4
`
`

`components: a parametric component reﬂecting ﬁle content infor-
`mation and a non-parametric component reﬂecting ﬁle relation in-
`formation.
`Let f be a vector, each of whose elements is the label (i.e., mali-
`cious or benign) of a ﬁle example to be predicted. The vector f can
`be generated from two parts, parametric and non-parametric ones.
`The parametric component follows a linear model, X!w, where
`each column of matrix X is the content feature vector of a ﬁle ex-
`ample, and w is the coefﬁcients. The non-parametric part is just a
`vector of h, each element of which corresponds to a value of a ﬁle
`example. Combining two parts, we have f = X!w + h.
`Now considering the labeling information vector y. Let yi = 1
`if the i-th ﬁle sample is malicious, yi = −1 if the i-th ﬁle sample
`is benign, yi = 0 if the i-th ﬁle sample is unlabeled. We can use
`hinge loss for labeled ﬁle samples as in Support Vector Machine,
`or use L2 loss for labeled data as in Least Square problems. For
`simplicity, we follow [29] to use L2 loss on all data points, i.e. $y−
`f$2. We also consider the global consistency on the co-occurrence
`graph [29], f !Lf, where the symmetric matrix L is the normalized
`Laplacian of the graph. Thus the total loss is
`$y − f$2 +
`where α is the weight for combining two parts of information,
`2 is just for convenience.
`adding 1
`To limit the model complexity, we add the regularization terms
`for w and h, which are
`
`f!Lf ,
`
`α 2
`
`1 2
`
`(2)
`
`w!w +
`
`h!h,
`
`1
`1
`2γ
`2β
`where β and γ are the regularization parameters.
`Putting Eq. (2) and Eq. (3) together, we have optimization prob-
`lem:
`
`(3)
`
`f !Lf +
`
`1
`2β
`
`w!w +
`
`1
`2γ
`
`h!h (4)
`
`α 2
`
`1 2
`
`$y − f$2 +
`f = X!w + h.
`
`min
`f ,w,h
`subject to
`
`To solve Eq. (4), we introduce Lagrange multiplier ξ.
`1
`$y − f$2 +
`2β
`1
`h!h + ξ!(f − X!w − h).
`+
`2γ
`As ∂L∂w = 0, ∂L∂h = 0, ∂L∂ξ = 0, and ∂L∂f = 0, we have
`
`f!Lf +
`
`w!w
`
`α 2
`
`1 2
`
`L(f , w, h; ξ) =
`
`Figure 6: Sample File Content Features in the Signature
`Database.
`
`co-occurrence graph to describe the relations among ﬁle samples.
`Generally, two ﬁles are related if they are shared by many clients
`(or equivalently, ﬁle lists). Note that many unexpected relations
`(like relations between infected samples and benign ﬁles) are ﬁrst
`removed using infected ﬁle detectors.
`The co-occurrence graph is deﬁned as G =< V, E > where V
`is the set of ﬁle samples. Given two ﬁle samples vi and vj, let Si
`be the set of ﬁle lists containing vi and Sj be the set of ﬁle lists
`containing vj. Then the similarity between vi and vj is computed
`as
`
`,
`
`sim(vi, vj) = |Si ∩ Sj|
`|Si ∪ Sj|
`where |S| denotes the size of a set S. If the similarity between a
`pair of ﬁle samples is greater than 0, then there is an edge between
`them and E is the set of edges between vertices.
`An example graph is shown in Figure 7 illustrating the real re-
`lations between some ﬁle samples, where the size of each edge
`indicates its weight.
`
`(1)
`
`Figure 7: An example graph of real relations between some
`ﬁle samples (purple color-malware samples, green color-benign
`ﬁles, transparent color-unknown ﬁles).
`
`Plugging Eqs. (5,6) into Eq. (7), we have
`f = (βX!X + γI)ξ,
`
`w = βXξ
`h = γξ
`f = X!w + h
`y = f + αLf + ξ
`
`(5)
`(6)
`(7)
`(8)
`
`4. A SEMI-PARAMETRIC MODEL FOR COM-
`BINING FILE CONTENT AND FILE RE-
`LATIONS
`In this section, we propose a semi-parametric model to combine
`ﬁle content and ﬁle relations for classiﬁcation using the graph reg-
`ularization framework. The semi-parametric model consists of two
`
`226
`
`or
`
`ξ = (βX!X + γI)−1f .
`Plugging it into Eq. (8), we have
`
`f =!I + αL + (βX!X + γI)−1"−1
`
`y.
`
`(9)
`
`This model is an extension of [29] by consider the parametric part.
`Note that if there are no content features, then f = h and this
`
`IPR2023-00124
`CrowdStrike EX1015 Page 5
`
`

`model reduces to traditional semi-supervised learning. Different
`from [31] and [30], this model does not infer the embedding.
`Computation Issues: We need to solve
`
`!I + αL + (βX!X + γI)−1" f = y
`(10)
`Let the size of X be p × n, where p is the number of feature and
`n is the number of instances, the average nonzeros element of L be
`κ. As long as p & n, we can follow the Woodbury identity, and
`Eq. (10) becomes
`!(1 + γ−1)I + αL − γ−1X(γβ−1I + XX!)−1X!" f = y
`
`(11)
`To solve Eq. (11), we can use conjugate gradient descent method.
`Computing XX! is O(np2), the inverse of (γβ−1I + XX!) is
`O(p3), and we precompute (γβ−1I+XX!)−1X! with O(np2 +
`p3). In each iteration of conjugate gradient descent, we compute
`
`#(1 + γ−1)I + αL − γ−1X(γβ−1I + XX!)−1X!$ v for some
`
`v. The computation of each iteration is O(n(p + κ)). The conver-
`gence rate depends on the condition number of the LHS matrix of
`Eq. (11).
`
`5. EXPERIMENTAL RESULTS
`AND ANALYSIS
`In this section, we conduct three sets of experimental studies
`using our data collection obtained from Comodo Cloud Security
`Center to fully evaluate the performance of our developed Valkyrie
`system: (1) In the ﬁrst set of experiments, we evaluate the effective-
`ness of ﬁle content based classiﬁer and ﬁle relation based classiﬁer
`for malware detection; (2) In the second part of experiments, we
`evaluate our proposed semi-parametric model based classiﬁer by
`comparing it with alternative methods for combining ﬁle content
`and ﬁle relations. (3) In the last set of experiments, we compare
`our Valkyrie system with some of the popular anti-malware soft-
`ware products such as Kaspersky Anti-Virus, MaAfee VirusScan,
`Bitdefender.
`5.1 Experimental Setup
`We measure the malware detection performance of different clas-
`siﬁers using the following evaluation measures:
`• True Positive (TP): the number of samples correctly classi-
`ﬁed as malicious ﬁles.
`• True Negative (TN): the number of samples correctly clas-
`siﬁed as benign ﬁles.
`• False Positive (FP): the number of samples mistakenly clas-
`siﬁed as malicious ﬁles.
`• False Negtive (FN): the number of samples mistakenly clas-
`siﬁed as benign ﬁles.
`• Accuracy (ACY):
`• Recall (RC):
`T heN umberOf T otalF ileCollection .
`The dataset we obtained from Comodo Cloud Security Center
`includes 37,930 user ﬁle lists that describe ﬁle relations between
`30,950 malware samples, 225,830 benign ﬁles and 434,870 un-
`known ﬁles (analyzed by the anti-malware experts of Comodo Se-
`curity Lab, 39,138 of them are malware, while 395,732 of them
`are benign ﬁles). We also have the ﬁle relation information for all
`the ﬁle samples. Using the feature extraction methods described in
`
`T P +T N
`T P +T N +F P +F N
`
`T P +T N +F P +F N
`
`Section 3, based on this data collection, 1) resting on the API calls
`extracted from the known ﬁle samples, we obtain 210,850 training
`ﬁle content vectors (since part of the ﬁle samples’ Import Table are
`invalid, 23,610 malicious ﬁles can be effectively extracted their API
`calls, while 187,240 benign samples are successfully extracted)
`with 86,757 dimensions; 2) from the collected user ﬁle lists, after
`excluding the unexpected relations (like relations between infected
`samples and benign ﬁles), we construct a graph including 248,986
`vertices (29,006 represent malicious ﬁles, while 219,980 represent
`benign samples) with 356,134 edges.
`All the experimental studies are conducted under the environ-
`ment of Windows 7 operating system plus Intel(R) Core(TM) i3
`CPU and 4 GB of RAM.
`5.2 Comparisons of File Content and File Re-
`lation Based Classiﬁers
`In this set of experiments, we evaluate the effectiveness of mal-
`ware detection results based on different feature representations:
`ﬁle content and ﬁle relations. The large collection of ﬁle sample
`data along with the high dimensionality and sparseness requires
`the classiﬁcation methods for malware detection to be scalable and
`robust. With the advantage of handling large feature space without
`overﬁtting, Support Vector Machine (SVM) has shown state-of-art
`results in classiﬁcation problems [22, 12, 28]. Therefore, in this
`section, we use SVM [7] as the base classiﬁer. For ﬁle content
`based classiﬁcation, SVM is applied on the features of API calls.
`For ﬁle relation based classiﬁcation, we treat ﬁle relations as the
`features for each ﬁle sample, i.e., the i-th feature is the similarities
`with the i-th ﬁle sample. Linear SVM [7] is used in both cases
`and the regularization parameter of SVM is selected using cross-
`validation.
`From Table 1 and Figure 8, we observe that the accuracy of ﬁle
`relation based classiﬁer is similar to ﬁle content based classiﬁer,
`while the recall of the ﬁle relation based classiﬁer greatly outper-
`forms the ﬁle content based classiﬁer for unknown ﬁle verdicts.
`
`Training
`F_Content
`F_Relation
`Testing
`F_Content
`F_Relation
`
`TP
`23,585
`27,018
`TP
`23,358
`25,969
`
`FP
`32
`880
`FP
`2,230
`6,880
`
`TN
`187,208
`219,100
`TN
`236,196
`312,100
`
`FN
`25
`1,988
`FN
`6,423
`9,988
`
`ACY
`0.9997
`0.9885
`ACY
`0.9677
`0.9525
`
`RC
`0.8211
`0.9696
`RC
`0.6168
`0.8162
`
`Table 1: Comparisons of File Content and File Relation Based
`Classiﬁers. Remark: "F_Content"-File Content based Classi-
`ﬁer, "F_Relation"-File Relation based Classiﬁer.
`
`5.3 Comparisons of Different Classiﬁers Com-
`bining File Content and File Relation
`In this section, we compare our semi-parametric model with the
`following methods of combining ﬁle relations and ﬁle content in-
`formation: (1) SVM on feature integration: We combine the con-
`tent features and the relation features and then apply SVM on the
`enlarged feature space. We use different weights for these two
`sets of features and the weights are selected using cross-validation.
`(2) SVM on kernel integration: We average the linear kernel on
`the content and the relation similarity (note that the co-occurrence
`graph can be viewed as a kernel) and apply SVM on the composite
`kernel. (3) joint-factorization: We use the supervised joint matrix
`factorization on both the content and relation information and then
`perform SVM on the resulting low dimensional embedding. For
`
`227
`
`IPR2023-00124
`CrowdStrike EX1015 Page 6
`
`

`Figure 8: Comparisons of File Content and File Relation Based
`Classiﬁers. Remark: "UM"-the number of malware from un-
`known ﬁle collection still unrecognized by classiﬁer, "UB"–the
`number of benign ﬁles from unknown ﬁle collection still unrec-
`ognized by classiﬁer.
`
`more details on this method, please refer to [31]. For our semi-
`parametric model, the parameters α, β, and γ are all set to 0.1.
`The results as shown in Table 2 and Figure 9 demonstrate that:
`(1) Combining ﬁle relation with ﬁle content can improve the classi-
`ﬁcation effectiveness for malware detection; (2) Combining ﬁle re-
`lation with ﬁle content, our proposed semi-parametric model based
`classiﬁer outperforms other alternative methods.
`
`Figure 9: Comparisons of Different Classiﬁers Combining File
`Content and File Relation. Remark: "F_Content"-File Con-
`tent based Classiﬁer, "F_Relation"-File Relation based Classi-
`ﬁer, "CR_C1"-SVM on Feature Integration, "CR_C2"-SVM
`on Kernel Integration, "CR_C3"-Joint-factorization Classiﬁer,
`"CR_SPM"-our proposed Semi-Parametric Model.
`
`Testing
`F_Content
`F_Relation
`CR_C1
`CR_C2
`CR_C3
`CR_SPM
`
`TP
`23,358
`25,969
`29,002
`28,123
`30,789
`34,675
`
`FP
`2,230
`6,880
`7,454
`8,358
`7,572
`563
`
`TN
`236,196
`312,100
`350,100
`349,196
`349,982
`356,991
`
`FN
`6,423
`9,988
`7,471
`8,350
`5,486
`1,798
`
`ACY
`0.9677
`0.9525
`0.9621
`0.9576
`0.9664
`0.9940
`
`RC
`0.6168
`0.8162
`0.9061
`0.9061
`0.9061
`0.9061
`
`AV.
`Kasp
`Nod32
`Mcafee
`BD
`Avira
`Valkyrie
`
`TP
`27,954
`26,589
`23,951
`28,763
`29,009
`34,675
`
`FP
`711
`923
`1,011
`780
`1,887
`563
`
`TN
`0
`0
`0
`0
`0
`356,991
`
`FN
`0
`0
`0
`0
`0
`1,798
`
`ACY
`0.9752
`0.9665
`0.9595
`0.9736
`0.9389
`0.9940
`
`RC
`0.0659
`0.0633
`0.0574
`0.0679
`0.0710
`0.9061
`
`Table 2: Comparisons of Different Classiﬁers Combining File
`Content and File Relation. Remark: "F_Content"-File Con-
`tent based Classiﬁer, "F_Relation"-File Relation based Classi-
`ﬁer, "CR_C1"-SVM on Feature Integration, "CR_C2"-SVM
`on Kernel Integration, "CR_C3"-Joint-factorization Classiﬁer,
`"CR_SPM"-our proposed Semi-Parametric Model.
`
`5.4 Comparisons with Different AV Venders
`In this section, we apply Valkyrie system in real applications to
`evaluate its malware detection effectiveness and efﬁciency on the
`daily data collection described in Section 5.1.
`5.4.1 Comparisons of Detection Effectiveness between
`Different AV Venders
`Based on 434,870 unknown ﬁles (analyzed by the anti-malware
`experts of Comodo Security Lab, 39,138 of them are malware,
`while 395,732 of them are benign ﬁles), we ﬁrst compare the mal-
`ware detection effectiveness of Valkyrie system with some of the
`popular AV products, like Kaspersky(Kasp), NOD32, Mcafee, Bit-
`defender(BD) and Avira. For comparison purpose, we use all of
`the Anti-Virus scanners’ latest versions of the base of signature on
`the same day(Feb 14th, 2011). Table 3 and Figure 10 show that the
`malware detection effectiveness of our Valkyrie outperforms other
`popular AV products based on our huge data collection.
`
`228
`
`Table 3: The malware detection results of different AV software
`products on the collection with 434,870 unknown ﬁles.
`
`Figure 10: The comparisons of malware detection results of
`different AV software products on the collection with 434,870
`unknown ﬁles.
`
`IPR2023-00124
`CrowdStrike EX1015 Page 7
`
`

`5.4.2 Comparisons of Detection Efﬁciency between
`Different AV Venders
`In this set

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go Monthly

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases