throbber
Communities of Interest
`
`Corinna Cortes, Daryl Pregibon & Chris Volinsky
`
`AT&T Shannon Research Labs
`Florham Park, New Jersey, USA
`
`Abstract. We consider problemsthat can be characterized by large dy-
`namic graphs. Communication networks provide the prototypical exam-
`ple of such problems where nodesin the graph are network IDs and the
`edges represent communication between pairs of network IDs. In such
`graphs, nodes and edges appear and disappear through time so that
`methods that apply to static graphs are not sufficient. We introduce a
`data structure that captures, in an approximate sense, the graph and
`its evolution through time. The data structure arises from a bottom-up
`representation of the large graph as the union of small subgraphs, called
`Communities of Interest (COI), centered on every node. These subgraphs
`are interesting in their own right and we discuss two applications in the
`area of telecommunications fraud detection to help motivate the ideas.
`
`1
`
`Introduction
`
`Transactional data consists of records of interactions between pairs of entities
`occurring over time. For example, a sequence of credit card transactions consists
`of purchases of retail goods by individual consumers from individual merchants.
`Transactional data can be represented by a graph where the nodesrepresent the
`transactors and the edges represent the interactions between pairs of transac-
`tors. Viewed in this way, interesting new questions can be posed concerning the
`connectivity of nodes, the presence of atomic subgraphs, or whether the graph
`structure leads to the identification and characterization of “interesting” nodes.
`For example, Kleinberg (1998) introduces the notion of “hubs” and “authori-
`ties” as interesting nodes on the internet. The data used by Kleinberg differ
`significantly from the data we consider in that he uses static links to induce a
`graph over web pages. In our case, we use actual networktraffic, as captured by
`interactions between pairs of transactors, to define our graph. Thus in a very
`real sense, the graph we consider is dynamic since nodes and edges appear and
`disappear from the graph through time.
`There are many challenging issues that arise for dynamic graphs and we have
`used a specific application to focus our research, namely the graph induced by
`calls carried on a large telecommunications network. This application is inter-
`esting, both becauseofits size (i.e., hundreds of millions of nodes) and its rate
`of change(i.e., hundreds of millions of new edges each day). Like all networks, it
`is also diverse in the sense that some nodesare relatively inactive while others
`are superactive.
`
`Page 1 of 10
`
`IA1021
`
`IA1021
`
`Page 1 of 10
`
`

`

`In thinking about dynamic graphs, the first question that arises concerns the
`definition of G,, namely the graph G at time t. The intuitive notion is that G;
`consists of the nodes and edges active at time ¢, or in a small] interval around
`t. We consider discrete time applications where new sets of nodes and edges
`corresponding to the transactions from time step ¢ to t+ 1 only becomeavailable
`at the end of the time step, for example once a day. Associated with every edge
`is a weight that is derived from an aggregation function applied to all (directed)
`transactions between a pair of nodes at time step ¢. For example, the aggregation
`function can be the “total duration of calls” or the “numberof calls” from one
`node to another.
`
`Let the graph corresponding to the transactions during time step ¢ be g;. We
`can define G, from g; where i = 1,...,¢ in several ways. Let us first define the
`sum of two graphs g and h
`
`G=ag@ fh
`
`where a and @ are scalars. The nodes and edges in G are obtained from the
`union of the nodes and edges in g and h. The weight of an edgein G is
`
`w(G) = aw(g) + Bw(h)
`
`where the weight of an edgeis set to zero if the edge is absent from the graph.
`If one defines G; = g:, G; is very unstable as it might change dramatically at
`each time step. On the other hand, if one defines
`
`t
`G=nOne...09%=Dau,
`i=1
`
`G; is perhaps toostable, as it includesall historic transactions from the beginning
`of time. To allow the graph G to track the dynamics of the transactional data
`stream, one can define G; as a moving window over n timesteps:
`t
`Gt = Gt_-n © Gt_-nti B®... OM = BD gi-
`i=t—n
`
`However, this definition suffers from two problems: first, one needs to store the
`graphs g; corresponding to the last n time steps to compute G,, and second, it
`gives equal weight to all time steps.
`To allow for a smooth dynamic evolution of G,; without incurring the storage
`problems of the moving window approach, we adopt the recursive definition:
`
`Gi = O0G:_-1 © (1— 9)a
`
`(1)
`
`where 0 < 6 < 1 is a parameter that allows more (@ near 1) or less (6 near 0)
`history to influence the current graph. An alternative representation of (1) is
`obtained by expanding the recursion:
`
`t
`Ge = wigs ® wago @... Burge = CP wigi
`i=l
`
`(2)
`
`Page 2 of 10
`
`IA1021
`
`IA1021
`
`Page 2 of 10
`
`

`

`where w; = 6*-*(1 — 6). This representation highlights the fact that more re-
`cent data contributes more heavily to G,; than older data. Figure 1 displays this
`graphically. If processing occurs daily, then a value of @ = 0.85 roughly corre-
`sponds to G capturing a moving month of network activity. Finally note that the
`convex combination defined in (1) is “smoother” than a simple 30 day moving
`window as network peaks and troughs, induced say by holiday traffic patterns,
`are effectively moderated by 6.
`
`
`
`factor
`
`Weight
`
`Age in days
`
`Fig. 1. Damping factor of edge weights as a function of time steps (days) in the recur-
`sive definition (2) of Gz. The values of 9 correspond to effectively no influence from an
`edge after 1 (9 = 0.40), 2 (0 = 0.65),
`3? (9 =0.77), or 4 weeks (6 = 0.83).
`
`The paper is organized as follows. Section 2 describes the data structure
`that we use to capture network activity and to evolve it through time. We also
`briefly discuss the strategy we employ to traverse this data structure to quickly
`and efficiently build graphs around individual nodes. Section 3 introduces a pair
`of examples that illustrate how these subgraphsare used in practice. Section 4
`summarizes the findings and discusses future work.
`
`2 Data Structure
`
`Weproposea constructive approach to evolving a large time-varying graph. Con-
`sider a nodein the graph, its associated directed edges, and weights associated
`with each edge. A data structure that consists of these weighted directed edge
`sets for each nodeis a representation of the complete graph. This data structure
`is redundant since it is indexed by nodes so that edges must be stored twice,
`once for the originating node and once for the terminating node. In contrast, a
`data structure that stores each edge once must be doubly indexed by nodes. The
`
`Page 3 of 10
`
`IA1021
`
`IA1021
`
`Page 3 of 10
`
`

`

`cost of edge duplication is often mitigated by gains in processing speed when
`subgraphs around nodesare expanded. For this reason we have chosen to repre-
`sent our graphsas a singly indexed list of nodes, each with an associated array
`of weighted directed edges.
`The data structure outlined above is complete in the sense that it captures
`the entire graph. Howeverin the applications that we are familiar with, the com-
`putational horsepower to maintain complete graphs with hundreds of millions
`and nodesand billions of edges is neither feasible nor desirable. Instead we de-
`fine a new graph where the atomic unit is the subgraph consisting of a node
`and its directed top-k edges to other nodes. The meaning of “top” is relative to
`the aggregation function applied to transactions associated with each edge, so
`it might be the top-k edges in terms of the “numberof calls.” In addition to
`the top-k inbound and top-k outbound edges, we also define an overflow node,
`called “other”, for aggregating traffic to/from nodes not contained in the top-k
`slots. While the value of k determines how well our data structure approximates
`the true network graph, it is worth noting that larger is not necessarily better
`when network graphs are used to study connectivity of the nodes. For example,
`assuming that a few percent ofall calls are misdialed, do we really want the
`corresponding edgesreflected in the graph? In our experience we have found the
`answer to this question to be “no” and have used a value of k that balances
`computational complexity (e.g., as regards speed and storage) with empirically
`determined accuracy (see below).
`In practice, the edges may not always be retained in a symmetric fashion.
`If one node receives calls from many nodes, like 8300CALLATT, it overflows its
`top-k slots so most callers will find their edges absorbed by the “other” node.
`However, any single node making calls to 8300CALLATT may not have edges to
`more than k other nodes, and the edge will be contained in its top-k edge set.
`Duplication of edges becomes complete only in the limit as k - oo. Forall finite
`k, the only invariance over the set of subgraphsis that the sum of the outbound
`edge weights equals the sum of the inbound edge weights, where the sum includes
`node “other”. For most of our applications, we use k= 9 since most residential
`long distance accounts do not exceed this number of edges (see Figure 2).
`To accommodate the time evolution of the network graph, the data structures
`are updated at fixed time steps. Between updating steps, transactions are col-
`lected and temporarily stored. At the end of that time period, the transactions
`are aggregated and the subgraph updated. The length of the time period rep-
`resents another trade-off in accuracy: the longer the time period, the better an
`estimate of the top-k edge set, but the more outdated the resulting subgraph.
`In the applications discussed in Section 3, we perform daily updates, thereby
`maintaining reasonable accuracy while requiring temporary disk space for only
`one day of data. For a more detailed discussion see Cortes and Pregibon (1999).
`Let Gi-1 denote the top-k approximation to G;_,; at time ¢ — 1 and let
`denote the graph derived from the new transactions at time step ¢. The ap-
`proximation to G;
`is formed from G;_, and gt, node by node, using a top-k
`approximation to Eq. 1:
`
`Page 4 of 10
`
`IA1021
`
`IA1021
`
`Page 4 of 10
`
`

`

`inout
`
`25
`
`20
`
`% of all accounts
`
`15
`
`10
`
`5
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`6
`Edge Set Size
`
`7
`
`8
`
`9
`
`9+
`
`IA1021
`
`Page 5 of 10
`
`

`

`Subgraphs centered at fraudulent nodes
`
`Subgraphs centered at legitimate nodes
`
`0.6
`
`0.6
`
`% of nodes
`
`0.4
`
`0.2
`
`% of nodes
`
`0.4
`
`0.2
`
`4+
`3
`2
`1
`Shortest path to any fraud node
`
`0.0
`
`4+
`3
`2
`1
`Shortest path to another fraud node
`
`0.0
`
`IA1021
`
`Page 6 of 10
`
`

`

`9
`
`3
`
`4
`
`2
`
`1
`
`1
`
`6
`
`29
`
`37
`
`13
`
`1.0
`
`0.9
`
`0.8
`
`0.7
`
`0.6
`
`0.5
`
`0.4
`
`% new cases labeled as fraud (after investigations)
`
`4
`6
`8
`10
`12
`14
`16
`number of nodes in COI already labeled as fraud
`
`XXX9814445
`
`XXX8882678
`
`XXX2975403
`
`XXX6673034
`
`XXX1781032
`
`XXX8778999
`
`XXX1782195
`
`XXX5325461
`
`XXX8667665
`
`XXX8007245
`
`XXX7711133
`
`XXX7957137
`
`XXX1781031
`
`XXX7832335
`
`XXX1781051
`
`XXX5982605
`
`XXX8660555
`
`XXX1781052
`
`XXX1081805
`
`XXX8001028
`
`XXX6526067
`
`XXX4468111
`
`XXX3260864
`
`XXX8008321
`
`XXX8002333
`
`XXX4477077
`
`XXX8002105
`
`IA1021
`
`Page 7 of 10
`
`

`

`Our process consists of the following steps:
`
`— compute the d, edgesets for all new accounts one week after they are acti-
`vated on the network
`— label each of the nodesin the resulting COI as fraudulentor legitimate based
`on the most recent information from security associates
`— rank the new accounts according to how much fraud appears in their COI
`
`The left panel in Figure 5 provides an illustrative example where five nodes
`surrounding a new suspect (labeled XXX8667665) were recently deactivated for
`fraudulent behavior. The right panel summarizes the performance of the method-
`ology for 105 cases presented to network security. While there is noise in the plot,
`it clearly shows that the probability that an accountis fraudulent is an increasing
`function of the number offraudulent nodes in its COI.
`
`3.2. Record Linkage Using COI-based Matching
`
`Consider the case where we have information on an account that was recently
`disconnected for fraud and weare looking for a new account that has the same
`individual behind it. Assuming that identity-theft was the root cause of the prior
`fraudulent account, it is likely that the new accountis under a different name and
`address than the old one(i.e., the fraudster has now assumedtheidentity of a new
`victim). We attack this problem with the intuition that while the subscription
`information is not useful for matching network IDs to the same individual, the
`calling patterns of the new account, as characterized by its COI, should not
`change very much from the previous account. Theleft panel of Figure 6 shows
`a convincing case where two nodes appearto belong to the sameindividual. We
`now have a problem of matching COI, with the underlying problem of deriving
`a reasonable distance function to quantify the closeness of a pair of COI.
`The matching problem is computationally difficult because of the size of
`our network — each day we see tens of thousands of new accounts. For each
`of these, we need to compute their COI, and then the distance from each of
`these to the COIof all recently confirmed fraudulent accounts. Assuming for
`these purposes that we maintain a library of the most recent 1000 fraudulent
`accounts, tens of millions of pairwise distances need to be computed. To carry
`out the computations we use a dy COI for all accounts in our “fraud library”
`and d, COI for all new accounts.
`The distance between two COI depends on both the quantity and the quality
`of the overlapping nodes. The quantity of the overlap is measured by counting
`the numberof overlapping nodesand calculating the percentage of a COI which
`consists of overlapping nodes. Howeverall overlapping nodes are not equally
`informative, so we need a measure of quality as well. Many graphs will inter-
`sect at high-use nodes, such as large telemarketing shops or widely advertised
`customer service numbers. An informative overlapping nodeis one that hasrel-
`atively low in- and out-degree, and in the best case, is shared only by the nodes
`under consideration for a match. We now describe a measure that captures these
`notions.
`
`Page 8 of 10
`
`IA1021
`
`IA1021
`
`Page 8 of 10
`
`

`

`1.0
`
`0.8
`
`% Correct Matches
`
`0.6
`
`0.4
`
`1
`
`2
`3
`4
`5
`6
`7
`8
`9
`10
`Deciles of Predicted Probability of a Match
`
`0.2
`
`0.0
`
`XXX6721477
`
`XXX2320800
`
`XXX4243460
`
`XXX9350316
`
`XXX6835622
`
`XXX2965398
`
`XXX3190804-
`XXX9724031
`
`XXX2549490
`
`XXX9350055
`
`XXX2768275
`
`XXX8534309
`
`XXX9241699
`
`XXX9687354
`
`XXX9724301
`
`XXX2057301
`
`XXX4762258
`
`XXX9749066
`
`XXX3657504
`
`XXX6360824
`
`XXX3584547
`
`XXX9375596
`
`XXX4456363
`
`XXX9724032
`
`IA1021
`
`Page 9 of 10
`
`

`

`
`
`View publication statsView publication stats
`
`IA1021
`
`Page 10 of 10
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket