`
`Corinna Cortes, Daryl Pregibon & Chris Volinsky
`
`AT&T Shannon Research Labs
`Florham Park, New Jersey, USA
`
`Abstract. We consider problemsthat can be characterized by large dy-
`namic graphs. Communication networks provide the prototypical exam-
`ple of such problems where nodesin the graph are network IDs and the
`edges represent communication between pairs of network IDs. In such
`graphs, nodes and edges appear and disappear through time so that
`methods that apply to static graphs are not sufficient. We introduce a
`data structure that captures, in an approximate sense, the graph and
`its evolution through time. The data structure arises from a bottom-up
`representation of the large graph as the union of small subgraphs, called
`Communities of Interest (COI), centered on every node. These subgraphs
`are interesting in their own right and we discuss two applications in the
`area of telecommunications fraud detection to help motivate the ideas.
`
`1
`
`Introduction
`
`Transactional data consists of records of interactions between pairs of entities
`occurring over time. For example, a sequence of credit card transactions consists
`of purchases of retail goods by individual consumers from individual merchants.
`Transactional data can be represented by a graph where the nodesrepresent the
`transactors and the edges represent the interactions between pairs of transac-
`tors. Viewed in this way, interesting new questions can be posed concerning the
`connectivity of nodes, the presence of atomic subgraphs, or whether the graph
`structure leads to the identification and characterization of “interesting” nodes.
`For example, Kleinberg (1998) introduces the notion of “hubs” and “authori-
`ties” as interesting nodes on the internet. The data used by Kleinberg differ
`significantly from the data we consider in that he uses static links to induce a
`graph over web pages. In our case, we use actual networktraffic, as captured by
`interactions between pairs of transactors, to define our graph. Thus in a very
`real sense, the graph we consider is dynamic since nodes and edges appear and
`disappear from the graph through time.
`There are many challenging issues that arise for dynamic graphs and we have
`used a specific application to focus our research, namely the graph induced by
`calls carried on a large telecommunications network. This application is inter-
`esting, both becauseofits size (i.e., hundreds of millions of nodes) and its rate
`of change(i.e., hundreds of millions of new edges each day). Like all networks, it
`is also diverse in the sense that some nodesare relatively inactive while others
`are superactive.
`
`Page 1 of 10
`
`IA1021
`
`IA1021
`
`Page 1 of 10
`
`
`
`In thinking about dynamic graphs, the first question that arises concerns the
`definition of G,, namely the graph G at time t. The intuitive notion is that G;
`consists of the nodes and edges active at time ¢, or in a small] interval around
`t. We consider discrete time applications where new sets of nodes and edges
`corresponding to the transactions from time step ¢ to t+ 1 only becomeavailable
`at the end of the time step, for example once a day. Associated with every edge
`is a weight that is derived from an aggregation function applied to all (directed)
`transactions between a pair of nodes at time step ¢. For example, the aggregation
`function can be the “total duration of calls” or the “numberof calls” from one
`node to another.
`
`Let the graph corresponding to the transactions during time step ¢ be g;. We
`can define G, from g; where i = 1,...,¢ in several ways. Let us first define the
`sum of two graphs g and h
`
`G=ag@ fh
`
`where a and @ are scalars. The nodes and edges in G are obtained from the
`union of the nodes and edges in g and h. The weight of an edgein G is
`
`w(G) = aw(g) + Bw(h)
`
`where the weight of an edgeis set to zero if the edge is absent from the graph.
`If one defines G; = g:, G; is very unstable as it might change dramatically at
`each time step. On the other hand, if one defines
`
`t
`G=nOne...09%=Dau,
`i=1
`
`G; is perhaps toostable, as it includesall historic transactions from the beginning
`of time. To allow the graph G to track the dynamics of the transactional data
`stream, one can define G; as a moving window over n timesteps:
`t
`Gt = Gt_-n © Gt_-nti B®... OM = BD gi-
`i=t—n
`
`However, this definition suffers from two problems: first, one needs to store the
`graphs g; corresponding to the last n time steps to compute G,, and second, it
`gives equal weight to all time steps.
`To allow for a smooth dynamic evolution of G,; without incurring the storage
`problems of the moving window approach, we adopt the recursive definition:
`
`Gi = O0G:_-1 © (1— 9)a
`
`(1)
`
`where 0 < 6 < 1 is a parameter that allows more (@ near 1) or less (6 near 0)
`history to influence the current graph. An alternative representation of (1) is
`obtained by expanding the recursion:
`
`t
`Ge = wigs ® wago @... Burge = CP wigi
`i=l
`
`(2)
`
`Page 2 of 10
`
`IA1021
`
`IA1021
`
`Page 2 of 10
`
`
`
`where w; = 6*-*(1 — 6). This representation highlights the fact that more re-
`cent data contributes more heavily to G,; than older data. Figure 1 displays this
`graphically. If processing occurs daily, then a value of @ = 0.85 roughly corre-
`sponds to G capturing a moving month of network activity. Finally note that the
`convex combination defined in (1) is “smoother” than a simple 30 day moving
`window as network peaks and troughs, induced say by holiday traffic patterns,
`are effectively moderated by 6.
`
`
`
`factor
`
`Weight
`
`Age in days
`
`Fig. 1. Damping factor of edge weights as a function of time steps (days) in the recur-
`sive definition (2) of Gz. The values of 9 correspond to effectively no influence from an
`edge after 1 (9 = 0.40), 2 (0 = 0.65),
`3? (9 =0.77), or 4 weeks (6 = 0.83).
`
`The paper is organized as follows. Section 2 describes the data structure
`that we use to capture network activity and to evolve it through time. We also
`briefly discuss the strategy we employ to traverse this data structure to quickly
`and efficiently build graphs around individual nodes. Section 3 introduces a pair
`of examples that illustrate how these subgraphsare used in practice. Section 4
`summarizes the findings and discusses future work.
`
`2 Data Structure
`
`Weproposea constructive approach to evolving a large time-varying graph. Con-
`sider a nodein the graph, its associated directed edges, and weights associated
`with each edge. A data structure that consists of these weighted directed edge
`sets for each nodeis a representation of the complete graph. This data structure
`is redundant since it is indexed by nodes so that edges must be stored twice,
`once for the originating node and once for the terminating node. In contrast, a
`data structure that stores each edge once must be doubly indexed by nodes. The
`
`Page 3 of 10
`
`IA1021
`
`IA1021
`
`Page 3 of 10
`
`
`
`cost of edge duplication is often mitigated by gains in processing speed when
`subgraphs around nodesare expanded. For this reason we have chosen to repre-
`sent our graphsas a singly indexed list of nodes, each with an associated array
`of weighted directed edges.
`The data structure outlined above is complete in the sense that it captures
`the entire graph. Howeverin the applications that we are familiar with, the com-
`putational horsepower to maintain complete graphs with hundreds of millions
`and nodesand billions of edges is neither feasible nor desirable. Instead we de-
`fine a new graph where the atomic unit is the subgraph consisting of a node
`and its directed top-k edges to other nodes. The meaning of “top” is relative to
`the aggregation function applied to transactions associated with each edge, so
`it might be the top-k edges in terms of the “numberof calls.” In addition to
`the top-k inbound and top-k outbound edges, we also define an overflow node,
`called “other”, for aggregating traffic to/from nodes not contained in the top-k
`slots. While the value of k determines how well our data structure approximates
`the true network graph, it is worth noting that larger is not necessarily better
`when network graphs are used to study connectivity of the nodes. For example,
`assuming that a few percent ofall calls are misdialed, do we really want the
`corresponding edgesreflected in the graph? In our experience we have found the
`answer to this question to be “no” and have used a value of k that balances
`computational complexity (e.g., as regards speed and storage) with empirically
`determined accuracy (see below).
`In practice, the edges may not always be retained in a symmetric fashion.
`If one node receives calls from many nodes, like 8300CALLATT, it overflows its
`top-k slots so most callers will find their edges absorbed by the “other” node.
`However, any single node making calls to 8300CALLATT may not have edges to
`more than k other nodes, and the edge will be contained in its top-k edge set.
`Duplication of edges becomes complete only in the limit as k - oo. Forall finite
`k, the only invariance over the set of subgraphsis that the sum of the outbound
`edge weights equals the sum of the inbound edge weights, where the sum includes
`node “other”. For most of our applications, we use k= 9 since most residential
`long distance accounts do not exceed this number of edges (see Figure 2).
`To accommodate the time evolution of the network graph, the data structures
`are updated at fixed time steps. Between updating steps, transactions are col-
`lected and temporarily stored. At the end of that time period, the transactions
`are aggregated and the subgraph updated. The length of the time period rep-
`resents another trade-off in accuracy: the longer the time period, the better an
`estimate of the top-k edge set, but the more outdated the resulting subgraph.
`In the applications discussed in Section 3, we perform daily updates, thereby
`maintaining reasonable accuracy while requiring temporary disk space for only
`one day of data. For a more detailed discussion see Cortes and Pregibon (1999).
`Let Gi-1 denote the top-k approximation to G;_,; at time ¢ — 1 and let
`denote the graph derived from the new transactions at time step ¢. The ap-
`proximation to G;
`is formed from G;_, and gt, node by node, using a top-k
`approximation to Eq. 1:
`
`Page 4 of 10
`
`IA1021
`
`IA1021
`
`Page 4 of 10
`
`
`
`inout
`
`25
`
`20
`
`% of all accounts
`
`15
`
`10
`
`5
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`6
`Edge Set Size
`
`7
`
`8
`
`9
`
`9+
`
`IA1021
`
`Page 5 of 10
`
`
`
`Subgraphs centered at fraudulent nodes
`
`Subgraphs centered at legitimate nodes
`
`0.6
`
`0.6
`
`% of nodes
`
`0.4
`
`0.2
`
`% of nodes
`
`0.4
`
`0.2
`
`4+
`3
`2
`1
`Shortest path to any fraud node
`
`0.0
`
`4+
`3
`2
`1
`Shortest path to another fraud node
`
`0.0
`
`IA1021
`
`Page 6 of 10
`
`
`
`9
`
`3
`
`4
`
`2
`
`1
`
`1
`
`6
`
`29
`
`37
`
`13
`
`1.0
`
`0.9
`
`0.8
`
`0.7
`
`0.6
`
`0.5
`
`0.4
`
`% new cases labeled as fraud (after investigations)
`
`4
`6
`8
`10
`12
`14
`16
`number of nodes in COI already labeled as fraud
`
`XXX9814445
`
`XXX8882678
`
`XXX2975403
`
`XXX6673034
`
`XXX1781032
`
`XXX8778999
`
`XXX1782195
`
`XXX5325461
`
`XXX8667665
`
`XXX8007245
`
`XXX7711133
`
`XXX7957137
`
`XXX1781031
`
`XXX7832335
`
`XXX1781051
`
`XXX5982605
`
`XXX8660555
`
`XXX1781052
`
`XXX1081805
`
`XXX8001028
`
`XXX6526067
`
`XXX4468111
`
`XXX3260864
`
`XXX8008321
`
`XXX8002333
`
`XXX4477077
`
`XXX8002105
`
`IA1021
`
`Page 7 of 10
`
`
`
`Our process consists of the following steps:
`
`— compute the d, edgesets for all new accounts one week after they are acti-
`vated on the network
`— label each of the nodesin the resulting COI as fraudulentor legitimate based
`on the most recent information from security associates
`— rank the new accounts according to how much fraud appears in their COI
`
`The left panel in Figure 5 provides an illustrative example where five nodes
`surrounding a new suspect (labeled XXX8667665) were recently deactivated for
`fraudulent behavior. The right panel summarizes the performance of the method-
`ology for 105 cases presented to network security. While there is noise in the plot,
`it clearly shows that the probability that an accountis fraudulent is an increasing
`function of the number offraudulent nodes in its COI.
`
`3.2. Record Linkage Using COI-based Matching
`
`Consider the case where we have information on an account that was recently
`disconnected for fraud and weare looking for a new account that has the same
`individual behind it. Assuming that identity-theft was the root cause of the prior
`fraudulent account, it is likely that the new accountis under a different name and
`address than the old one(i.e., the fraudster has now assumedtheidentity of a new
`victim). We attack this problem with the intuition that while the subscription
`information is not useful for matching network IDs to the same individual, the
`calling patterns of the new account, as characterized by its COI, should not
`change very much from the previous account. Theleft panel of Figure 6 shows
`a convincing case where two nodes appearto belong to the sameindividual. We
`now have a problem of matching COI, with the underlying problem of deriving
`a reasonable distance function to quantify the closeness of a pair of COI.
`The matching problem is computationally difficult because of the size of
`our network — each day we see tens of thousands of new accounts. For each
`of these, we need to compute their COI, and then the distance from each of
`these to the COIof all recently confirmed fraudulent accounts. Assuming for
`these purposes that we maintain a library of the most recent 1000 fraudulent
`accounts, tens of millions of pairwise distances need to be computed. To carry
`out the computations we use a dy COI for all accounts in our “fraud library”
`and d, COI for all new accounts.
`The distance between two COI depends on both the quantity and the quality
`of the overlapping nodes. The quantity of the overlap is measured by counting
`the numberof overlapping nodesand calculating the percentage of a COI which
`consists of overlapping nodes. Howeverall overlapping nodes are not equally
`informative, so we need a measure of quality as well. Many graphs will inter-
`sect at high-use nodes, such as large telemarketing shops or widely advertised
`customer service numbers. An informative overlapping nodeis one that hasrel-
`atively low in- and out-degree, and in the best case, is shared only by the nodes
`under consideration for a match. We now describe a measure that captures these
`notions.
`
`Page 8 of 10
`
`IA1021
`
`IA1021
`
`Page 8 of 10
`
`
`
`1.0
`
`0.8
`
`% Correct Matches
`
`0.6
`
`0.4
`
`1
`
`2
`3
`4
`5
`6
`7
`8
`9
`10
`Deciles of Predicted Probability of a Match
`
`0.2
`
`0.0
`
`XXX6721477
`
`XXX2320800
`
`XXX4243460
`
`XXX9350316
`
`XXX6835622
`
`XXX2965398
`
`XXX3190804-
`XXX9724031
`
`XXX2549490
`
`XXX9350055
`
`XXX2768275
`
`XXX8534309
`
`XXX9241699
`
`XXX9687354
`
`XXX9724301
`
`XXX2057301
`
`XXX4762258
`
`XXX9749066
`
`XXX3657504
`
`XXX6360824
`
`XXX3584547
`
`XXX9375596
`
`XXX4456363
`
`XXX9724032
`
`IA1021
`
`Page 9 of 10
`
`
`
`
`
`View publication statsView publication stats
`
`IA1021
`
`Page 10 of 10
`
`