throbber
Analysis of Communities of Interest in Data Networks
`
`William Aiello , Charles Kalmanek , Patrick McDaniel ,
`
`Subhabrata Sen , Oliver Spatscheck , and Jacobus Van der Merwe   Department of Computer Science, University of British Columbia,
`
`Vancouver, B.C. V6T 1Z4, Canada
`aiello@cs.ubc.ca
` AT&T Labs (cid:150) Research,
`Florham Park, NJ 07932, U.S.A.,
` crk,sen,spatsch,kobus
`@research.att.com
` Department of Computer Science and Engineering, Penn State University,
`University Park, PA 16802, U.S.A.
`mcdaniel@cse.psu.edu
`
`Abstract. Communities of interest (COI) have been applied in a variety of envi-
`ronments ranging from characterizing the online buying behavior of individuals
`to detecting fraud in telephone networks. The common thread among these ap-
`plications is that the historical COI of an individual can be used to predict future
`behavior as well as the behavior of other members of the COI. It would clearly be
`bene(cid:2)cial if COIs can be used in the same manner to characterize and predict the
`behavior of hosts within a data network. In this paper, we introduce a methodol-
`ogy for evaluating various aspects of COIs of hosts within an IP network. In the
`context of this study, we broadly de(cid:2)ne a COI as a collection of interacting hosts.
`We apply our methodology using data collected from a large enterprise network
`over a eleven week period. First, we study the distributions and stability of the
`size of COIs. Second, we evaluate multiple heuristics to determine a stable core
`set of COIs and determine the stability of these sets over time. Third, we evaluate
`how much of the communication is not captured by these core COI sets.
`
`1 Introduction
`
`Data networks are growing in size and complexity. A myriad of new services, mobil-
`ity, and wireless communication make managing, securing, or even understanding these
`networks signi(cid:2)cantly more dif(cid:2)cult. Network management platforms and monitoring
`infrastructures often provide little relief in untangling the Gordian knot that many envi-
`ronments represent.
`In this paper, we aim to understand how hosts communicate in data networks by
`studying host level communities of interest (COIs). A community of interest is a collec-
`tion of entities that share a common goal or environment. In the context of this study,
`we broadly de(cid:2)ne a community of interest as a collection of interacting hosts. Using
`data collected from a large enterprise network, we construct community graphs repre-
`senting the existence and density of host communications. Our hypothesis is that the
`
`
`

`

`behavior of a collection of hosts has a great deal of regularity and structure. Once such
`structure is illuminated, it can be used to form parsimonious models that can become
`the basis of management policy. This study seeks to understand the structure and nature
`of communities of interest ultimately to determine if communities of interest are a good
`approximation of these models. If true, communities of interest will be useful for many
`purposes, including:
`
`(cid:150) network management - because of similar goals and behavior, communities will
`serve as natural aggregates for management
`(cid:150) resource allocation - allocating resources (e.g., printers, disk arrays, etc.) by com-
`munity will increase availability and ensure inter-community fairness
`(cid:150) traf(cid:2)c engineering - pro(cid:2)les of communal behavior will aid capacity planning and
`inform prioritization of network resource use
`(cid:150) security - because communities behave in a consistent manner, departure from the
`norm may indicate malicious activity
`
`Interactions between social communities and the Web have been widely studied [1,
`2]. These works have shown that the web exhibits the small world phenomena [3, 4],
`i.e., any two points in the web are only separated by a few links. These results indi-
`cate that digital domains are often rationally structured and may be a re(cid:3)ection of the
`physical world. We hypothesize that host communication re(cid:3)ects similar structure and
`rationality, and hence can be used to inform host management. In their work in net-
`work management, Tan et. al. assumed that hosts with similar connection habits play
`similar roles within the network [5]. They focused on behavior within local networks
`by estimating host roles, and describe algorithms that segment a network into host role
`groups. The authors suggest that such groups are natural targets of aggregated man-
`agement. However, these algorithms are targeted to partitioning hosts based on some
`a priori characteristic. This differs from the present work in that we seek to identify
`those characteristics that are relevant. Communities of interest can also expose aberrant
`behavior. Cortes et. al. illustrated this ability in a study of fraud in the telecommunica-
`tions industry [6]. They found that people who re-subscribed under a different identity
`after defaulting on an account could be identi(cid:2)ed by looking at the similarity of the new
`account’s community.
`This paper extends these and many other works in social and digital communities of
`interest by considering their application to data networks. We begin this investigation in
`the following section by outlining our methodology. We develop the meaning of com-
`munities of interest in data networks and then explain how our data was collected and
`pre-processed. While the data set that we analyze is limited to traf(cid:2)c from an enterprise
`network, we believe that the methodology is more broadly applicable to data networks
`in general. In Section 3 we present the results of our analysis and conclude the paper in
`Section 4 with a summary and indication of future work.
`
`2 Methodology
`
`In this section we consider the methodology we applied to the COI study. First we
`develop an understanding of what COI means in the context of a data network. Then we
`
`IA1023
`
`Page 2 of 14
`
`

`

`explain how we collected the data from an enterprise network and what pre-processing
`we had to perform on the data before starting our analysis.
`
`2.1 Communities of Interest
`
`We have informally de(cid:2)ned COI for a data network as a collection of interacting hosts.
`In the broadest sense this would imply that the COI of a particular host consists of all
`hosts that it interacts with. We call the host for which we are trying to (cid:2)nd a COI the
`target-host. We begin our analysis by exploring this broad COI de(cid:2)nition, by looking
`at the total number of hosts that target-hosts from our data set interact with. Thus in this
`(cid:2)rst step we only look at the COI set size and its stability over time.
`Considering all other hosts that a target-host ever communicates with to be part of its
`COI might be too inclusive. For example, this would include one-time-only exchanges
`which should arguably not be considered part of a host’s COI. Intuitively we want to
`consider as part of the COI the set of hosts that a target-host interact with on a regular
`basis. We call this narrower COI de(cid:2)nition the core COI.
`In this work it is not our goal to come up with a single core COI de(cid:2)nition. In-
`stead, it is our expectation that depending on the intended application of COI, different
`de(cid:2)nitions might be relevant. For example, in a resource allocation application the rele-
`vant COI might be centered around speci(cid:2)c protocols or applications to ensure that the
`COI for those applications receive adequate resources. On the other hand an intrusion
`detection application might be concerned about deviations from some (cid:147)normal(cid:148) COI.
`However, in order to evaluate our methodology, we do suggest and apply to our data
`two example de(cid:2)nitions of a core COI:
`
`(cid:150) Popularity: We determine the COI for a group of target-hosts by considering a host
`to be part of the COI if the percentage of target-hosts interacting with it exceeds a
`threshold
`, over some time period of interest
`.
`(cid:150) Frequency: A host is considered to be part of the COI of a target-host, if the target-
`(the bin-size) within
`host interacts with it at least once every small time-period
`some larger time period of interest
`.
`
`Intuitively these two de(cid:2)nitions attempt to capture two different constituents of a
`core COI. The most obvious is the Frequency COI which captures any interaction that
`happens frequently, for example access to a Web site containing news that gets updated
`frequently. The Popularity COI attempts to capture interactions that might happen ei-
`ther frequently or infrequently but is performed by a large part of the user population.
`An example would be access to a time-reporting server or a Web site providing travel
`related services.
`From the COI de(cid:2)nitions it is clear that the Popularity COI becomes more inclusive
`in terms of allowing hosts into the COI as the threshold (
`) decreases. Similarly the
`Frequency COI becomes more inclusive as the bin-size increase. For the Popularity
`case where the threshold is zero, all hosts active in the period-of-interest are considered
`to be part of the COI. Similarly, for the Frequency case where the bin-size is equal to
`the period-of interest, all hosts in that period are included in the COI. When the period-
`of-interest,
`, is the same for the two core COI de(cid:2)nitions, these two special cases (i.e.,
`
`IA1023
`
`Page 3 of 14
`
`
`

`
`

`
`

`

`

`for the Frequency COI), therefore produce
`
`for the Popularity COI and
` 

` 
`the same COI set.
`Notice that the Popularity COI de(cid:2)nes a core COI set for a (cid:147)group(cid:148) of hosts, where-
`as the Frequency COI de(cid:2)nes a per-host COI. We have made our core COI de(cid:2)nitions
`in the most general way by applying it to (cid:147)hosts(cid:148), i.e., not considering whether the
`host was the initiator (or client) or responder (or server) in the interaction4. While these
`general de(cid:2)nitions hold, in practice it might be useful to take directionality into account.
`For example, the major servers in a network can be identi(cid:2)ed by applying the Popularity
`de(cid:2)nition to the percentage of clients initiating connections to servers. Similarly, the
`Frequency de(cid:2)nition can be limited to clients connecting to servers at least once in
`every bin-size interval to establish a per-client COI.
`In the second step of our analysis we drill deeper into the per-host interactions of
`hosts in our data set to determine the different core COI sets. Speci(cid:2)cally, we determine
`the Popular COI and the Frequency COI from a client perspective and consider their
`stability over time.
`Ultimately we hope to be able to predict future behavior of hosts based on their
`COIs. We perform an initial evaluation of how well core COIs capture the future be-
`havior of hosts. Speci(cid:2)cally, we combine all the per-host Client-Frequency COIs with
`the shared Popularity COI to create an Overall COI. We construct this COI using data
`from a part of our measurement period and then evaluate how well it captures host be-
`havior for the remainder of our data by determining how many host interactions are not
`captured by the Overall COI.
`
`2.2 Data Collection and Pre-processing
`
`To perform the analysis presented in this paper we collected eleven weeks worth of
`(cid:3)ow records from a single site in a large enterprise environment consisting of more
`than 400 distributed sites connected by a private IP backbone and serving a total user
`population in excess of 50000 users. The (cid:3)ow records were collected from a number
`of LAN switches using the Gigascope network monitor [7]. The LAN switches and Gi-
`gascope were con(cid:2)gured to monitor all traf(cid:2)c for more than 300 hosts which included
`desktop machines, notebooks and lab servers. This set of monitored hosts for which we
`captured traf(cid:2)c in both directions are referred to as the local hosts and form the focal
`point of our analysis. In addition to some communication amongst themselves, the local
`hosts mostly communicated with other hosts in the enterprise network (referred to as
`internal hosts) as well as with hosts outside the enterprise environment (i.e., external
`hosts). We exclude communication with external hosts from our analysis as our initial
`focus is on intra-enterprise traf(cid:2)c. During the eleven week period we collected (cid:3)ow
`records corresponding to more than 4.5 TByte of network traf(cid:2)c. In our traces we only
`found TCP, UDP and ICMP traf(cid:2)c except for some small amount of RSVP traf(cid:2)c be-
`tween two test machines which we ignored. For this initial analysis we also removed
`weekend data from our data set, thus ensuring a more consistent per-day traf(cid:2)c mix.
`Similarly, we also excluded from the analysis any hosts that were not active at least
`once a week during the measurement period.
`
`

`

`Our measurement infrastructure generated unidirectional (cid:3)ow-records for moni-
`tored traf(cid:2)c in 5 minute intervals or bins. A (cid:3)ow is de(cid:2)ned using the normal 5-tuple of
`IP protocol type, source/destination addresses and source/destination port numbers. We
`record the number of bytes and number of packets for each (cid:3)ow. In addition, each (cid:3)ow
`record contains the start time of the 5 minute bin and timestamps for the (cid:2)rst packet and
`last packet of the (cid:3)ow within the bin interval. The collected (cid:147)raw(cid:148) (cid:3)ow-records need to
`be processed in a number of ways before being used for our analysis:
`Dealing with DHCP: First, because of the use of Dynamic Host Con(cid:2)guration
`Protocol (DHCP), not all IP addresses seen in our raw data are unique host identi(cid:2)ers.
`We use IP address to MAC address mappings from DHCP logs to ensure that all the
`(cid:3)ow records of each unique host are labeled with a unique identi(cid:2)er.
`Flow-record processing: The second pre-processing step involves combining (cid:3)ows
`in different 5 minute intervals that belong together from an application point of view.
`For example, consider a File Transfer Protocol (FTP) application which transfers a very
`large (cid:2)le between two hosts. If the transfer span several 5 minute intervals then the (cid:3)ow
`records in each interval corresponding to this transfer should clearly be combined to
`represent the application level interaction. However, even for this simple well-known
`application, correctly representing the application semantics would in fact involve as-
`sociating the FTP-control connection with the FTP-data connection, the latter of which
`is typically initiated from the FTP-server back to the FTP-client.
`Applying such application speci(cid:2)c knowledge to our (cid:3)ow-records is not feasible
`in general because of the sheer number of applications involved and the often undocu-
`mented nature of their interactions. We therefore make the following simplifying de(cid:2)-
`nition in order to turn our (cid:3)ows records into a data set that captures some application
`speci(cid:2)c semantics. We de(cid:2)ne a server as any host that listens on a socket for the pur-
`pose of other hosts talking to it. Further, we de(cid:2)ne a client as any host that initiates
`a connection to such a server port. Clearly this de(cid:2)nition does not perfectly capture
`application level semantics. For example, applying this de(cid:2)nition to our FTP interac-
`tion, only the control connection would be correctly identi(cid:2)ed in terms of application
`level semantics. This client/server de(cid:2)nition does however provide us with a very gen-
`eral mechanism that can correctly classify all transport level semantics while capturing
`some of the application level semantics.
`To summarize then, during the second pre-processing step we combine or splice
`(cid:3)ow-records in two ways: First, (cid:3)ow-records for the same interaction that span multiple
`5 minute intervals should be combined. Second, we combine two uni-directional (cid:3)ow-
`records into a single record representing client-server interaction.
`To splice (cid:3)ow-records that span multiple 5-minute intervals, we use the 5-tuple of
`protocol and source/destination addresses and ports. We deal with the potential of long
`time intervals between matching (cid:3)ows by de(cid:2)ning an aggregation time such that if the
`time gap between two (cid:3)ow records using the same 5-tuple exceed the aggregation time,
`the new (cid:3)ow-record is considered the start of a new interaction. If the aggregation time
`is too short, later (cid:3)ow-records between these hosts will be incorrectly classi(cid:2)ed as a new
`interaction. Making the aggregation time too long can introduce erroneous classi(cid:2)cation
`for short lived interactions. We experimented with different values of aggregation time
`
`IA1023
`
`Page 5 of 14
`
`

`

`and found a value of 120 minutes provided a good compromise between incorrectly
`splitting (cid:3)ows that (cid:2)t together and incorrectly combining separate (cid:3)ows.
`The 5-tuple is again used to combine two unidirectional (cid:3)ows into a single interac-
`tion. For TCP and UDP, two (cid:3)ow-records are combined into a single record if the (cid:3)ows
`are between the same pair of hosts and use the same port numbers in a swapped fashion
`(i.e., the source port in one direction is the same as the destination port in the reverse di-
`rection). For ICMP traf(cid:2)c, (cid:3)ow-records are combined if they are between the same pair
`of hosts. The result of splicing two unidirectional (cid:3)ows together is an edge-record and
`we present the data as a directed graph in which each edge represents a communication
`between a client and a server and each node represents a unique host. The direction
`of the edge represents client/server designation and the labels on the edge indicate the
`number of packets and bytes (cid:3)owing in each direction between the two nodes.
`We evaluated the experimental error introduced by our (cid:3)ow-record processing as
`week subset of our total
`week data set for this evaluation.
`follows. We consider a
`
`We note that (cid:3)ows labeled with a client port number below 1024 and a server port
`number above 1024 is highly likely to be incorrect for all but a few services (as it is
`not consistent with the normal use of reserved ports), and the reverse (server port
`1024, and client port
`1024) are likely to be correct. We bound experimental error
`by calculating the ratio of incorrect to correct labeled (cid:3)ows based on this heuristic
`(after removing known services that violate this property, e.g., ftp-data, NFS traf(cid:2)c
`through sunrpc). This approximation yields a 2.187% role assignment error for all
`traf(cid:2)c, while the numbers for TCP and UDP are 2.193% and 2.181%, respectively.
`Each instance of mis-interpreted directionality introduces an additional (cid:3)ow into the
`data set. Hence, such errors do not change the structure of the community, but slightly
`amplify a host’s role as a client or server.
`Removing unwanted traf(cid:2)c: Since we are interested in characterizing the (cid:147)useful(cid:148)
`traf(cid:2)c in the enterprise network the third pre-processing step involves removing all
`graph edges for suspected unwanted traf(cid:2)c, such as network scans or worm activity.
`Doing such cleaning with 100% accuracy is infeasible because unwanted traf(cid:2)c is often
`indistinguishable from useful traf(cid:2)c. We use the following heuristics:
`
`(cid:150) TCP: We clean the data by removing all edges which do not have more than 3
`packets in each direction. We chose the number three since a legitimate application
`layer data transfer needs more than three packets to open, transfer and close the
`TCP connection. This cleaning removes 16% of all edges indicating that a large
`fraction of traf(cid:2)c in the monitored network does not complete an application-level
`data transfer.
`(cid:150) UDP: We observe that there are two types of legitimate UDP uses. One is re-
`quest/response type interaction such as performed by DNS and RPC. The other
`is a long lived UDP (cid:3)ow as used by many streaming applications. In both cases we
`expect an edge which performs a useful task to be associated with at least two pack-
`ets, either in the same direction or in opposing directions. Therefore, we remove all
`edges for which the sum of packets in both directions is smaller than 2.
`(cid:150) ICMP: We do not perform any cleaning on the ICMP data since a single ICMP
`datagram is a legitimate use of ICMP.
`
`IA1023
`
`Page 6 of 14
`
`
`
`
`

`

`100
`
`10−1
`
`CCDF
`
`101
`102
`# Servers to which local client connects
`
`(a)
`
`10−2
`
`Local Client
`Local Server
`Local Host
`
`103
`
`101
`
`102
`
`Num hosts
`
`(b)
`
`103
`
`102
`
`101
`
`# Clients connecting to local server
`
`100
`100
`
`Fig. 1. (a) Scatterplot of 151 local hosts: Clients using the local host as a server and the local host
`talking to servers as a client. (b) CCDF:local host communication for total 11 week period.
`3 Results
`
`In this section we present the COI analysis as applied to the enterprise data we collected.
`After pre-processing, the (cid:2)nal data set we used for the analysis consisted of 6.1 million
`edge-records representing 151 local hosts and 3823 internal hosts and corresponding
`to 2.6 TBytes worth of network traf(cid:2)c. We will characterize only the set of 151 local
`hosts, but consider all their interactions, with both other local and internal hosts.
`
`3.1 Community of Interest Set Size
`First we evaluate the COI of the set of local hosts in our data set based on the broadest
`de(cid:2)nition of COI. Speci(cid:2)cally we consider the number of other hosts that each local
`host interact with. We look at the total number of such hosts and then do a breakdown
`based on whether the target local host was acting as a client or a server.
`We (cid:2)rst perform this analysis for all hosts over the entire measurement period. Fig-
`ure 1(a) shows a scatter plot of the in/out-degree of the set of 151 local hosts considering
`all observed traf(cid:2)c. The Y-axis shows the number of clients connecting to the local host
`acting as a server (i.e., in-degree). The X-axis shows the number of servers that the lo-
`cal host connects to acting as a client (i.e., out-degree). Observe from Figure 1(a) that
`most hosts act as both client and server over the observation period. Indeed for the total
`traf(cid:2)c breakdown shown, all hosts act as both client and server during the measurement
`period. The general observation that most hosts act as both client and server, hold when
`data is analyzed on a per-protocol basis. Speci(cid:2)cally, counting the number of hosts that
`acted purely as clients on a per protocol basis we get only 3 for TCP, 2 for UDP and
`1 for ICMP. Similarly, counting the number of hosts acting purely as servers on a per
`protocol basis we get none for TCP, 2 for UDP and 5 for ICMP. Further, as indicated
`by the density below the diagonal line, the majority of local hosts are mostly acting as
`clients. For the plot shown, 111 hosts are below and 35 hosts above the diagonal line.
`The implication of the simple observation that most hosts act as both clients and servers,
`is that security schemes that rely on hosts acting exclusively as clients or servers, are
`likely to be infeasible in current enterprise networks.
`
`IA1023
`
`Page 7 of 14
`
`

`

`100
`
`10−1
`
`CCDF
`
`100
`
`10−1
`
`CCDF
`
`10−2
`
`Local Client
`Local Server
`Local Host
`
`101
`
`102
`
`Max Daily Num. hosts
`
`(a) Maximum
`
`10−2
`
`Local Client
`Local Server
`Local Host
`
`10−1
`
`100
`Num hosts (Norm. Std. Dev.)
`
`(b) Normalized standard-deviation
`
`Fig. 2. CCDFs for the number of hosts communicated with on daily basis.
`Figure 1(b) shows the empirical Complementary Cumulative Distribution Function
`(CCDF) of the number of machines that our local hosts communicate with for all traf(cid:2)c
`over the entire 11 week measurement period. The (cid:147)Local Host(cid:148) curve corresponds to
`the total number of hosts (either local or internal) that a particular local host interacts
`with, whether as a client or as a server. The plot shows that each of the local hosts com-
`municates with a fairly small community of other hosts even over a period of several
`weeks. For example, 90% of the local hosts talks to fewer than 186 other hosts. Consid-
`ering the client/server breakdown, the same holds true with local hosts interacting with
`a fairly small number of servers and clients. The (cid:2)nal 10% of the (cid:147)Local Server(cid:148) curve
`shows that a small number of local machines acting as servers have higher numbers of
`clients talking to them than the other 90% of the local servers. These machines most
`likely correspond to (cid:147)real(cid:148) servers that serve a signi(cid:2)cant client population as opposed
`to hosts that are servers on the basis of the protocol interaction only.
`We next look at the COI of each host on a daily basis and examine the statistical
`properties of these daily values over the complete observation period. First, Figure 2(a)
`shows the CCDF of the maximum daily number of hosts that each local host commu-
`nicates with over the entire eleven weeks. These maximum number per day CCDFs are
`similar to those for the maximum over the entire measurement period, Figure 1(b), but
`the numbers are lower (i.e., the curves are (cid:147)shifted(cid:148) to the left). For example, the 90th
`percentile number for the (cid:147)Local Host(cid:148) curve in Figure 2(a) is only 77 compared with
`186 for the same percentile in Figure 1(b). Also similar to Figure 1(b), there is an in(cid:3)ec-
`tion at the 10% point in Figure 2(a) for the (cid:147)Local Server(cid:148) (and (cid:147)Local Host(cid:148)) curves
`which is likely caused by (cid:147)real(cid:148) servers.
`The relatively small sizes of the total number of hosts communicated with over the
`entire period as well as the small per-day maximums for the vast majority of hosts,
`suggest that a simple anomaly detection approach based on monitoring the normal COI
`size, has the potential to detect abnormal activities like port scans and worm spreads.
`These anomalies are often marked by a host communicating with a large number of
`other machines within a very short time span.
`Next we consider the variability of the per-day COI size for each local host over
`the entire measurement period. Figure 2(b) shows the resulting CCDF of the normal-
`
`IA1023
`
`Page 8 of 14
`
`

`

`ized standard deviation (normalized by the mean for each local host). Note that some
`of the variability is a result of hosts being inactive on some days, one contributing rea-
`son being telecommuting users. Hosts for these users might either be inactive because
`they are not being used, or in the case of notebooks, might not be visible to our moni-
`toring infrastructure. The graph shows that approximately 70% of the local hosts have
`normalized standard-deviations in their per-day COI size that is less than
`. Assuming
`that all of the traf(cid:2)c in our data set was indeed legitimate, this would mean a simplistic
`approach to detect abnormal behavior for these hosts, based on a policy that restricts
`(cid:147)normal(cid:148) per-day COI size to
`times the respective per-day means, would result in
`false alarms being generated only 5% of the time. Note also from Figure 2(b) that the
`standard deviation for the (cid:147)Local Client(cid:148) curve is less skewed than the (cid:147)Local Server(cid:148)
`curve. This suggests that on a daily basis the number of servers which a local client
`talks to, is more stable than the number of clients that talk to a local server. The impli-
`cation of this is that network management policies derived from observations close to
`the initiator of communication (client) is likely to be more stable than policies derived
`from traf(cid:2)c close to the communication responder (server).
`
`3.2 Core Communities of Interests
`
`We next explore our two example core COI de(cid:2)nitions Popularity and Frequency core
`COIs and their interactions.
`
`week 1
`week 4
`week 7
`week 11
`
`101
`# Threshold (%)
`
`102
`
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`# Servers in set
`
`0
`100
`
`Fig. 3. Size of Popularity COI set for all traf(cid:2)c.
`
`Popularity COI: Recall that for the Popularity COI we consider a host to be part of
`the COI for a group of target-hosts if the percentage of target-hosts interacting with it
`exceeds a threshold
`over some period of interest
`. Here we identify the Popularity
`COI of the local hosts from a client view point, for each of the
`weeks in our data
`
`set (i.e.,
`is one week). Figure 3 shows the size of the Popularity core COI set as a
`equally spaced weeks out of the total
`function of the threshold
`for
`weeks, for
`
`traf(cid:2)c across all protocols. The graphs shows the expected decline of the set size as one
`progresses from a threshold of 0% (which would include all hosts) to a threshold of
`
`IA1023
`
`Page 9 of 14
`
`
`
`
`

`

`
`
`

`

`Union(T=5%)
`Union(T=10%)
`Union(T=20%)
`Union(T=30%)
`Intersection(T=5%)
`Intersection(T=10%)
`Intersection(T=20%)
`Intersection(T=30%)
`
`250
`
`200
`
`150
`
`100
`
`50
`
`Num. Servers in set
`
`Union(11 weeks)
`Union(6 weeks)
`Union(3 weeks)
`week 1
`Intersection(3 weeks)
`Intersection(6 weeks)
`Intersection(11 weeks)
`
`101
`Threshold (%)
`
`(a)
`
`102
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`6
`7
`Time period (weeks)
`
`8
`
`9
`
`10
`
`11
`
`(b)
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`Num. Servers in set
`
`0
`100
`
`(b) As a
`
`Fig. 4. Popularity COI: Union and intersection set size (a) As a function of threshold
`function of length of time window
`weeks (
`).
` !
`100% at which point the size is expected to be very small as it would require all target-
`hosts to communicate with each member of the set. We observe that the size of the core
`COI set as a function of the threshold is very similar across the different weeks. This
`suggests that, deviations from the Popularity COI size distribution, for a set of hosts
`monitored over time, would be a strong indication of a network anomaly.
`While the stability in the core COI set size is encouraging, we are also interested
`in the stability and predictability of the core COI set membership. To evaluate this, we
`determine the core COI set for each week in our data and then explore how the mem-
`bership of these sets change over the measurement period. We do this by calculating
`the union (the set of servers that belong to the core set in at least one week) and the
`intersection (the set of servers that belong to the core set in every week) of the COI sets.
`For any two sets the difference between the size of the union and intersection represents
`a measure of the (cid:147)churn(cid:148) between the two sets - that is the total number of elements that
`needs to be added or removed from one set to transform it to the other set. Therefore,
`COI sets, the difference between the union and intersection of all the
`for a window of
`sets, represents an upper bound on the churn between any two pairs in
`. By looking
`at this bound we get a worst case estimate of how much the COI membership changes
`). By progressively increasing the length of the time window,
`over the time window (
`we determine how this worst case estimate changes over time.
`Figure 4(a) depicts the sizes of the union and intersection of core COI sets for weeks
`1 to 3, 1 to 6 and 1 to 11, as a function of the threshold
`for all traf(cid:2)c. For comparison
`the core COI set size for week 1 is also shown. For all curves (i.e., for all time periods
`considered), the difference between the union and intersection set sizes, i.e., the churn,
`tends to decrease as the threshold increases. Figure 4(b) shows the same data, but in this
`case we show the union and intersection set sizes for selected thresholds for increasing
`of interest (
`to
`weeks), starting from week
`, i.e., 1 to 2 weeks,
`time windows
`#
`1 to 3 weeks, etc. As expected, the union set size size increases and the intersection
`set size decreases for a given threshold as the time window increases. Notice though
`from Figure 4(b) that for any threshold, the union and intersection set sizes change in a
`sub-linear fashion with increasing
`. In fact the intersection seems to (cid:3)atten within 6 to
`
`IA1023
`
`Page 10 of 14
`
`
`
`"
`"
`"
`
`"
`
`
`"
`

`

`Union Size(Bin=60H)
`Union Size(Bin=24H)
`Union Size(Bin=12H)
`Intersection Size(Bin=60H)
`Intersection Size(Bin=24H)
`Intersection Size(Bin=12H)
`
`2
`
`3
`
`4
`
`5
`6
`7
`Time Period (weeks)
`
`8
`
`9
`
`10
`
`11
`
`700
`
`600
`
`500
`
`400
`
`300
`
`Set Size
`
`200
`
`100
`
`0
`
`1
`
`Union Size(11 weeks)
`Union Size(6 weeks)
`Union Size(3 weeks)
`Set Size (week 1)
`Intersection Size (3 weeks)
`Intersection Size (6 weeks)
`Intersection Size (11 weeks)
`
`10
`
`20
`
`30
`40
`Bin Size (Hours)
`
`50
`
`60
`
`70
`
`700
`
`600
`
`500
`
`400
`
`300
`
`Set Size
`
`200
`
`100
`
`0
`
`0
`
`(a)
`
`(b)
`
`(b)
`
`Fig. 5. Overall Frequency COI: Union and intersection set size (a) As a function of bin size
`As a function of length of time window
`weeks (
`).
`%&'
`8 weeks. While the union set size shows a continued small growth, the maximum union
`set size, for

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket