throbber
Communities of Interest
`
`Corinna Cortes, Daryl Pregibon & Chris Volinsky
`
`AT&T Shannon Research Labs
`
`Florham Park, New Jersey, USA
`
`Abstract. We consider problems that can be characterized by large dy—
`namic graphs. Communication networks provide the prototypical exam—
`ple of such problems where nodes in the graph are network IDs and the
`edges represent communication between pairs of network IDs. In such
`graphs, nodes and edges appear and disappear through time so that
`methods that apply to static graphs are not sufficient. We introduce a
`data structure that captures, in an approximate sense, the graph and
`its evolution through time. The data structure arises from a bottom—up
`representation of the large graph as the union of small subgraphs, called
`Communities of Interest (COI), centered on every node. These subgraphs
`are interesting in their own right and we discuss two applications in the
`area of telecommunications fraud detection to help motivate the ideas.
`
`1
`
`Introduction
`
`Transactional data consists of records of interactions between pairs of entities
`occurring over time. For example, a sequence of credit card transactions consists
`of purchases of retail goods by individual consumers from individual merchants.
`Transactional data can be represented by a graph where the nodes represent the
`transactors and the edges represent the interactions between pairs of transac—
`tors. Viewed in this way, interesting new questions can be posed concerning the
`connectivity of nodes, the presence of atomic subgraphs, or whether the graph
`structure leads to the identification and characterization of “interesting” nodes.
`For example, Kleinberg (1998) introduces the notion of “hubs” and “authori-
`ties” as interesting nodes on the internet. The data used by Kleinberg differ
`significantly from the data we consider in that he uses static links to induce a
`graph over web pages. In our case, we use actual network traffic, as captured by
`interactions between pairs of transactors, to define our graph. Thus in a very
`real sense, the graph we consider is dynamic since nodes and edges appear and
`disappear from the graph through time.
`There are many challenging issues that arise for dynamic graphs and we have
`used a specific application to focus our research, namely the graph induced by
`calls carried on a large telecommunications network. This application is inter—
`esting, both because of its size (i.e., hundreds of millions of nodes) and its rate
`of change (i.e., hundreds of millions of new edges each day). Like all networks, it
`is also diverse in the sense that some nodes are relatively inactive while others
`are superactive.
`
`Page 1 of 10
`
`IA1021
`
`IA1021
`
`Page 1 of 10
`
`

`

`In thinking about dynamic graphs, the first question that arises concerns the
`definition of gt, namely the graph g at time t. The intuitive notion is that g,
`consists of the nodes and edges active at time t, or in a small interval around
`If. We consider discrete time applications where new sets of nodes and edges
`corresponding to the transactions from time step t to t+1 only become available
`at the end of the time step, for example once a day. Associated with every edge
`is a weight that is derived from an aggregation function applied to all (directed)
`transactions between a pair of nodes at time step t. For example, the aggregation
`function can be the “total duration of calls77 or the “number of calls” from one
`node to another.
`
`Let the graph corresponding to the transactions during time step t be 9,5. We
`can define 9,; from 9,- where i = 1, .
`.
`. ,t in several ways. Let us first define the
`sum of two graphs 9 and h
`
`G = ag G9 fih
`
`where 04 and [3 are scalars. The nodes and edges in G are obtained from the
`union of the nodes and edges in g and h. The weight of an edge in G is
`
`W(G) = 04W(9) + fiWUL)
`
`where the weight of an edge is set to zero if the edge is absent from the graph.
`If one defines gt 2 gt, gt is very unstable as it might change dramatically at
`each time step. On the other hand, if one defines
`
`t
`
`Qt=gi®g2®~®gt=$gu
`i=1
`
`gt is perhaps too stable, as it includes all historic transactions from the beginning
`of time. To allow the graph gt to track the dynamics of the transactional data
`stream, one can define 9,; as a moving window over n time steps:
`t
`
`gtzgtinQBQtinJrlQBH-EBgt: 69 913
`iztin
`
`However, this definition suffers from two problems: first, one needs to store the
`graphs 9,- corresponding to the last n time steps to compute Q, and second, it
`gives equal weight to all time steps.
`To allow for a smooth dynamic evolution of Q, without incurring the storage
`problems of the moving window approach, we adopt the recursive definition:
`
`gt 2 ggt—l EB (1 — 0)gt
`
`(1)
`
`where 0 g 0 g 1 is a parameter that allows more ((9 near 1) or less ((9 near 0)
`history to influence the current graph. An alternative representation of (1) is
`obtained by expanding the recursion:
`
`Gt 2 mm 63 wgg2 63 .
`
`.
`
`. 63 wtgt = Esz-gi
`
`(2)
`
`Page 2 of 10
`
`IA1021
`
`IA1021
`
`Page 2 of 10
`
`

`

`where w, = 0t_i(1 — 0). This representation highlights the fact that more re—
`cent data contributes more heavily to g, than older data. Figure 1 displays this
`graphically. If processing occurs daily, then a value of 0 = 0.85 roughly corre—
`sponds to gt capturing a moving month of network activity. Finally note that the
`convex combination defined in (1) is “smoother” than a simple 30 day moving
`window as network peaks and troughs, induced say by holiday traffic patterns,
`are effectively moderated by 6.
`
`
`
`Weight
`
`factor
`
`Fig. 1. Damping factor of edge weights as a function of time steps (days) in the recur-
`sive definition (2) of g. The values of0 correspond to efiectiuely no influence from an
`edge after 1 (0 : 0.40), 2 {6 : 0.65), 3 (0 = 0.77), 07‘ 4 weeks (0 = 0.83).
`
`The paper is organized as follows. Section 2 describes the data structure
`that we use to capture network activity and to evolve it through time. We also
`briefly discuss the strategy we employ to traverse this data structure to quickly
`and efficiently build graphs around individual nodes. Section 3 introduces a pair
`of examples that illustrate how these subgraphs are used in practice. Section 4
`summarizes the findings and discusses future work.
`
`2 Data Structure
`
`We propose a constructive approach to evolving a large time-varying graph. Con-
`sider a node in the graph, its associated directed edges, and weights associated
`with each edge. A data structure that consists of these weighted directed edge
`sets for each node is a representation of the complete graph. This data structure
`is redundant since it is indexed by nodes so that edges must be stored twice,
`once for the originating node and once for the terminating node. In contrast, a
`data structure that stores each edge once must be doubly indexed by nodes. The
`
`Page 3 of 10
`
`IA1021
`
`IA1021
`
`Page 3 of 10
`
`

`

`cost of edge duplication is often mitigated by gains in processing speed when
`subgraphs around nodes are expanded. For this reason we have chosen to repre—
`sent our graphs as a singly indexed list of nodes, each with an associated array
`of weighted directed edges.
`
`The data structure outlined above is complete in the sense that it captures
`the entire graph. However in the applications that we are familiar with, the com—
`putational horsepower to maintain complete graphs with hundreds of millions
`and nodes and billions of edges is neither feasible nor desirable. Instead we de-
`fine a new graph where the atomic unit is the subgraph consisting of a node
`and its directed top—k edges to other nodes. The meaning of “top” is relative to
`the aggregation function applied to transactions associated with each edge, so
`it might be the top-k: edges in terms of the “number of calls.” In addition to
`the top-k inbound and top-k: outbound edges, we also define an overflow node,
`called “other”, for aggregating traffic to/from nodes not contained in the top-k:
`slots. While the value of k determines how well our data structure approximates
`the true network graph, it is worth noting that larger is not necessarily better
`when network graphs are used to study connectivity of the nodes. For example,
`assuming that a few percent of all calls are misdialed, do we really want the
`corresponding edges reflected in the graph? In our experience we have found the
`answer to this question to be “no” and have used a value of k that balances
`computational complexity (e.g., as regards speed and storage) with empirically
`determined accuracy (see below).
`
`In practice, the edges may not always be retained in a symmetric fashion.
`If one node receives calls from many nodes, like SOOCALLATT, it overflows its
`top—k slots so most callers will find their edges absorbed by the “other” node.
`However, any single node making calls to 8OOCALLATT may not have edges to
`more than k: other nodes, and the edge will be contained in its top-k edge set.
`Duplication of edges becomes complete only in the limit as k —> 00. For all finite
`k, the only invariance over the set of subgraphs is that the sum of the outbound
`edge weights equals the sum of the inbound edge weights, where the sum includes
`node “other”. For most of our applications, we use k: 9 since most residential
`long distance accounts do not exceed this number of edges (see Figure 2).
`
`To accommodate the time evolution of the network graph, the data structures
`are updated at fixed time steps. Between updating steps, transactions are col-
`lected and temporarily stored. At the end of that time period, the transactions
`are aggregated and the subgraph updated. The length of the time period rep-
`resents another trade—off in accuracy: the longer the time period, the better an
`estimate of the top—k edge set, but the more outdated the resulting subgraph.
`In the applications discussed in Section 3, we perform daily updates, thereby
`maintaining reasonable accuracy while requiring temporary disk space for only
`one day of data. For a more detailed discussion see Cortes and Pregibon (1999).
`Let @354 denote the top-k: approximation to 9,5,1 at time t — 1 and let gt
`denote the graph derived from the new transactions at time step t. The ap-
`proximation to gt
`is formed from @354 and 9,5, node by node, using a top-k:
`approximation to Eq. 1:
`
`Page 4 of 10
`
`IA1021
`
`IA1021
`
`Page 4 of 10
`
`

`

`inout
`
`25
`
`20
`
`% of all accounts
`
`15
`
`10
`
`5
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`6
`Edge Set Size
`
`7
`
`8
`
`9
`
`9+
`
`IA1021
`
`Page 5 of 10
`
`

`

`Subgraphs centered at fraudulent nodes
`
`Subgraphs centered at legitimate nodes
`
`0.6
`
`0.6
`
`% of nodes
`
`0.4
`
`0.2
`
`% of nodes
`
`0.4
`
`0.2
`
`4+
`3
`2
`1
`Shortest path to any fraud node
`
`0.0
`
`4+
`3
`2
`1
`Shortest path to another fraud node
`
`0.0
`
`IA1021
`
`Page 6 of 10
`
`

`

`9
`
`3
`
`4
`
`2
`
`1
`
`1
`
`6
`
`29
`
`37
`
`13
`
`1.0
`
`0.9
`
`0.8
`
`0.7
`
`0.6
`
`0.5
`
`0.4
`
`% new cases labeled as fraud (after investigations)
`
`4
`6
`8
`10
`12
`14
`16
`number of nodes in COI already labeled as fraud
`
`XXX9814445
`
`XXX8882678
`
`XXX2975403
`
`XXX6673034
`
`XXX1781032
`
`XXX8778999
`
`XXX1782195
`
`XXX5325461
`
`XXX8667665
`
`XXX8007245
`
`XXX7711133
`
`XXX7957137
`
`XXX1781031
`
`XXX7832335
`
`XXX1781051
`
`XXX5982605
`
`XXX8660555
`
`XXX1781052
`
`XXX1081805
`
`XXX8001028
`
`XXX6526067
`
`XXX4468111
`
`XXX3260864
`
`XXX8008321
`
`XXX8002333
`
`XXX4477077
`
`XXX8002105
`
`IA1021
`
`Page 7 of 10
`
`

`

`Our process consists of the following steps:
`
`* compute the dg edge sets for all new accounts one week after they are acti-
`vated on the network
`
`— label each of the nodes in the resulting 001 as fraudulent or legitimate based
`on the most recent information from security associates
`— rank the new accounts according to how much fraud appears in their 001
`
`The left panel in Figure 5 provides an illustrative example where five nodes
`surrounding a new suspect (labeled XXX8667665) were recently deactivated for
`fraudulent behavior. The right panel summarizes the performance of the method-
`ology for 105 cases presented to network security. While there is noise in the plot,
`it clearly shows that the probability that an account is fraudulent is an increasing
`function of the number of fraudulent nodes in its COI.
`
`3.2 Record Linkage Using COI-based Matching
`
`Consider the case where we have information on an account that was recently
`disconnected for fraud and we are looking for a new account that has the same
`individual behind it. Assuming that identity—theft was the root cause of the prior
`fraudulent account, it is likely that the new account is under a different name and
`address than the old one (i.e., the fraudster has now assumed the identity of a new
`Victim). We attack this problem with the intuition that while the subscription
`information is not useful for matching network IDs to the same individual, the
`calling patterns of the new account, as characterized by its COI, should not
`change very much from the previous account. The left panel of Figure 6 shows
`a convincing case where two nodes appear to belong to the same individual. We
`now have a problem of matching 001, with the underlying problem of deriving
`a reasonable distance function to quantify the closeness of a pair of 001.
`The matching problem is computationally difficult because of the size of
`our network 7 each day we see tens of thousands of new accounts. For each
`of these, we need to compute their COL and then the distance from each of
`these to the 001 of all recently confirmed fraudulent accounts. Assuming for
`these purposes that we maintain a library of the most recent 1000 fraudulent
`accounts, tens of millions of pairwise distances need to be computed. To carry
`out the computations we use a d2 COI for all accounts in our “fraud library”
`and d1 001 for all new accounts.
`The distance between two 001 depends on both the quantity and the quality
`of the overlapping nodes. The quantity of the overlap is measured by counting
`the number of overlapping nodes and calculating the percentage of a 001 which
`consists of overlapping nodes. However all overlapping nodes are not equally
`informative, so we need a measure of quality as well. Many graphs will inter—
`sect at high—use nodes, such as large telemarketing shops or widely advertised
`customer service numbers. An informative overlapping node is one that has rel-
`atively low in— and out-degree, and in the best case, is shared only by the nodes
`under consideration for a match. We now describe a measure that captures these
`notions.
`
`Page 8 of 10
`
`IA1021
`
`IA1021
`
`Page 8 of 10
`
`

`

`1.0
`
`0.8
`
`% Correct Matches
`
`0.6
`
`0.4
`
`1
`
`2
`3
`4
`5
`6
`7
`8
`9
`10
`Deciles of Predicted Probability of a Match
`
`0.2
`
`0.0
`
`XXX6721477
`
`XXX2320800
`
`XXX4243460
`
`XXX9350316
`
`XXX6835622
`
`XXX2965398
`
`XXX3190804-
`XXX9724031
`
`XXX2549490
`
`XXX9350055
`
`XXX2768275
`
`XXX8534309
`
`XXX9241699
`
`XXX9687354
`
`XXX9724301
`
`XXX2057301
`
`XXX4762258
`
`XXX9749066
`
`XXX3657504
`
`XXX6360824
`
`XXX3584547
`
`XXX9375596
`
`XXX4456363
`
`XXX9724032
`
`IA1021
`
`Page 9 of 10
`
`

`

`
`
`View publication statsView publication stats
`
`IA1021
`
`Page 10 of 10
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket