`US009614814B2
`
`c12) United States Patent
`Fontecchio
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 9,614,814 B2
`Apr. 4, 2017
`
`(54) SYSTEM AND METHOD FOR CASCADING
`TOKEN GENERATION AND DATA
`DE-IDENTIFICATION
`
`(71) Applicant: Management Science Associates, Inc.,
`Pittsburgh, PA (US)
`
`(58) Field of Classification Search
`CPC ............. H04L 63/0807; H04L 63/0876; H04L
`9/0643; H04L 63/0474; H04L 63/0421;
`(Continued)
`
`(56)
`
`References Cited
`
`(72)
`
`Inventor: Tony Fontecchio, Irwin, PA (US)
`
`U.S. PAIBNT DOCUMENTS
`
`(73) Assignee: Management Science Associates, Inc.,
`Pittsburgh, PA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by O days.
`
`(21) Appl. No.: 15/046,202
`
`(22) Filed:
`
`Feb. 17, 2016
`
`(65)
`
`Prior Publication Data
`
`US 2016/0182231 Al
`
`Juu. 23, 2016
`
`(63)
`
`(51)
`
`(52)
`
`Related U.S. Application Data
`
`Continuation of application No. 14/291,805, filed on
`May 30, 2014, now Pat. No. 9,292,707.
`(Continued)
`
`(2006.01)
`(2006.01)
`(2006.01)
`(2013.01)
`(2006.01)
`(2006.01)
`
`Int. Cl.
`H04L 9/32
`H04L29/06
`H04L 9106
`G06F 21162
`H04L 9114
`H04L 9/30
`U.S. Cl.
`CPC ..... . H04L 6310421 (2013.01); G06F 2116254
`(2013.01); H04L 910643 (2013.01);
`(Continued)
`
`6,397,224 Bl
`6,732,113 Bl
`
`5/2002 Zubeldia et al.
`5/2004 Ober et al.
`(Continued)
`
`OTHER PUBLICATIONS
`
`Bouzelat et al. NPL 1996-Extraction and Anonymity Protocol of
`Medical file.*
`
`(Continued)
`
`Primary Examiner - Kaveh Abrishamkar
`Assistant Examiner - Tri Tran
`(74) Attorney, Agent, or Firm
`
`The Webb Law Firm
`
`ABSTRACT
`(57)
`A computer-implemented method for de-identifying data by
`creating tokens through a cascading algorithm includes the
`steps of processing at least one record comprising a plurality
`of data elements to identify a subset of data elements
`comprising data identifying at least one individual; gener(cid:173)
`ating, with at least one processor, a first hash by hashing at
`least one first data element with at least one second data
`element of the subset of data elements; generating, with at
`least one processor, a second hash by hashing the first hash
`with at least one third data element of the subset of data
`elements; creating at least one token based at least partially
`on the second hash or a subsequent hash derived from the
`second hash, wherein the token identifies the at least one
`individual; and associating at least a portion of a remainder
`of the data elements with the at least one token.
`
`20 Claims, 6 Drawing Sheets
`
`I .
`'--~i:~__.r ........ .__w_a_ptie_"__,
`-·-· -· -- -· -· -· ---. - ,- . ---· -· -· -· -. -· -
`
`Otfier dtita
`
`Client
`
`Doto Suppl!er
`
`m
`
`I
`
`,,,_Q
`
`113
`
`"'
`................ -. -........... -:-- ....................... -
`...
`'"'"'
`I Mctchlio,engine I
`I Token~=ing I
`
`DtrtcProcesslnij
`
`lOI
`
`_
`
`engmeJ.ll!
`
`r
`
`1000
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 1 of 14
`
`
`
`US 9,614,814 B2
`Page 2
`
`Related U.S. Application Data
`(60) Provisional application No. 61/830,345, filed on Jun.
`3, 2013.
`(52) U.S. CI.
`CPC .................. H04L 9/14 (2013.01); H04L 9/30
`(2013.01); H04L 9/3234 (2013.01); H04L
`9/3239 (2013.01); H04L 63/0442 (2013.01);
`H04L 63/0807 (2013.01); H04L 63/0876
`(2013.01); H04L 2209/42 (2013.01); H04L
`2463/062 (2013.01)
`
`( 58) Field of Classification Search
`CPC . G06F 19/322; G06F 21/6245; G06F 21/6254
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4/2007 Paltenghe et al.
`7,200,578 B2
`7,280,663 Bl * 10/2007 Golomb ................ H04L 9/3066
`380/255
`
`7,376,677 B2
`7,865,376 B2
`8,473,452 Bl
`8,930,404 B2
`2008/0147554 Al•
`
`5/2008 Ober et al.
`1/2011 Ober et al.
`6/2013 Ober et al.
`1/2015 Ober et al.
`6/2008 Stevens ................. G06F 19/322
`705/51
`
`OTHER PUBLICATIONS
`
`Bouzelat et al., Extraction and Anonymity Protocol of Medical File,
`Department of Medical Informatics (Pr. L. Dusserre ), Teaching
`Hospital of Dijon France, AMIA, Inc., 1996, pp. 323-327.
`Fraser et al., Tools for De-Identification of Personal Health Infor(cid:173)
`mation, Prepared for the Pan Canadian Health Information Privacy
`(HIP) Group, Sep. 2009, 40 pages.
`Kunitz et al., Record Linkage Methods Applied to Health Related
`Administrative Data Sets Containing Racial and Ethnic Descriptors,
`Record Linkage Techniques, 1997, pp. 295-304.
`Scheuren, Linking Health Records: Human Rights Concerns,
`Record Linkage Techniques, 1997, pp. 404-426.
`
`* cited by examiner
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 2 of 14
`
`
`
`U.S. Patent
`
`Apr. 4, 2017
`
`Sheet 1 of 6
`
`US 9,614,814 B2
`
`Client
`106
`
`........•
`
`Other data
`suppliers
`
`-------------------~------------------
`
`1
`
`Public key
`
`Doto Supplier
`103
`
`--------------------------------------
`.
`___ ,_ ... ;_..
`///
`115
`
`113
`"--... -~ .............. ,
`
`•
`
`~ !
`·-·-----------------~-----------------
`....
`Data Processing
`Entity
`108
`Matching engine
`109
`
`De-ID data...,. ___ _
`111
`
`Token processing
`engine 110
`
`L •
`
`-
`
`•
`
`-
`
`•
`
`-
`
`-
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`-• -
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`-
`
`•
`
`FIG. 1
`
`1000
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 3 of 14
`
`
`
`U.S. Patent
`
`Apr. 4, 2017
`
`Sheet 2 of 6
`
`US 9,614,814 B2
`
`221
`
`Initial key
`
`201
`
`first
`hash
`, .....
`211
`
`,,/
`
`I
`
`203
`
`204
`
`second J third
`
`hash
`
`_,./
`
`f
`213
`
`hash
`/
`/_,,.
`
`215
`FIG. 2A
`
`-.~--" \
`
`200
`
`219
`
`✓
`
`/
`
`/
`I
`
`token
`
`,-----· - - , ,_ __ ......., ,.,.,/
`
`221
`/
`,-/
`,
`
`223
`
`Initial key
`
`hash key
`
`223
`
`hash key
`
`206
`
`201
`
`203
`
`/---11>
`
`(
`
`200
`
`first
`hosh
`
`second
`hash
`
`/
`I
`211
`
`_,.,....,,
`I
`213
`
`. . . ... ... ...
`..
`' .. ...
`
`Nth
`hash
`
`"-·········-1
`217
`
`FIG. 2B
`
`token
`
`219
`,-/
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 4 of 14
`
`
`
`U.S. Patent
`
`Apr. 4, 2017
`
`Sheet 3 of 6
`
`US 9,614,814 B2
`
`219
`_j
`
`--G
`
`Hosh
`function
`
`Hash
`function
`
`Hash
`function
`
`/
`220
`
`,,,-/
`220
`
`key
`
`FIG. 2C
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 5 of 14
`
`
`
`U.S. Patent
`
`Apr. 4, 2017
`
`Sheet 4 of 6
`
`US 9,614,814 B2
`
`301
`
`303
`
`305
`\
`:
`'--, .. 1
`
`l,,:
`
`identify seed from
`configuration file
`
`hosh o first data element of
`record with seed
`
`hash a next data element with
`previous hash result
`
`308
`' -...... ,i l create token based on hash sequence
`309
`
`generate transient key
`1.,1nique to session
`
`encrypt token with transient key
`
`encrypt the encrypted token and
`transient key with public key
`associated with data processing
`entity
`
`FIG. 3A
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 6 of 14
`
`
`
`U.S. Patent
`
`Apr. 4, 2017
`
`Sheet 5 of 6
`
`US 9,614,814 B2
`
`315
`............................................................................................. ,
`'----,,
`
`\
`
`receive encrypted output file from
`data supplier
`
`317
`'----
`
`decrypt encrypted output file with
`private key to obtain a transient
`key and an encrypted token
`I
`319
`'"'·--~ , decrypt the encrypted token with
`j
`the transient key
`,
`l _____________ _.
`
`l
`
`, ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
`
`325
`
`hash the token with a seed unique
`to the client or data supplier
`
`327
`\._
`-- match the token to a record for an
`individual from a record database
`
`FIG. 3B
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 7 of 14
`
`
`
`U.S. Patent
`
`Apr. 4, 2017
`
`Sheet 6 of 6
`
`US 9,614,814 B2
`
`\
`$wt
`(
`''--. ___ _ j
`
`r····························1
`
`Line
`
`/
`
`.····················"
`Command W. 6.1 • Read and Validate
`i
`
`··················•
`!
`i
`
`Command Line
`
`/
`/
`
`/
`Configutation
`FJle
`r·····························
`··········-·········•
`t······1
`••••••·•••·••••••··
`!
`\ ............................ \
`; i 6.2 • Read and Validate
`I
`I
`Log File
`(
`i->i
`i······+I
`_
`... (---j '----~'-~----'
`
`'----------'
`
`ll••I•• lfritit~I••••••••\JI
`
`/t"
`, fi.4.2~~$~~,.o • ~---l ·· ····· ... · ···. ···• 4
`"'·
`/
`"(S ... c.•.•••.••,••,.™•.of•••.···•••1•.i·n·,••.·.P. t.:·· . .,, ')--Yes~.•·
`•··. •··••··.••· Log • ........... ,
`. .~ogfffe
`,
`' ' ' / , /
`.. • .. · . ... : . . . • ... ... :.;
`\~··~----~
`:
`No
`.................. t. ............... .
`
`"4"T..,...I I
`
`.::· .. ·.>\
`/"::·.
`\'2nd ....... '
`\..
`'·=··=====· /
`
`FIG. 4
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 8 of 14
`
`
`
`US 9,614,814 B2
`
`1
`SYSTEM AND METHOD FOR CASCADING
`TOKEN GENERATION AND DATA
`DE-IDENTIFICATION
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`
`This application is a continuation of U.S. patent applica(cid:173)
`tion Ser. No. 14/291,805, filed May 30, 2014, which claimed
`the benefit of U.S. Provisional Application No. 61/830,345,
`filed on Jun. 3, 2013, the entire disclosures of each of which
`are hereby incorporated by reference.
`
`BACKGROUND OF THE INVENTION
`
`Field of the Invention
`The present invention relates generally to data de-identi(cid:173)
`fication and, in particular, a system and method for de(cid:173)
`identifying data using cascading token generation.
`Description of Related Art
`For decades, data including personally-identifying infor(cid:173)
`mation has been de-identified through the creation of tokens
`that uniquely identify an individual. This technology has
`been used in connection with consumer package goods data,
`television data, subscriber data, healthcare data, and the like.
`Traditionally, methods for creating tokens for a specific
`record associated with an individual involved concatenating
`selected data elements into a string, and then encrypting that
`string to form a token. However, there are scenarios in which
`concatenated substrings will yield less than optimal results.
`Advances in computing power now allow for token genera(cid:173)
`tion to be complex, even across large volumes of data,
`providing for enhanced data security. Moreover, once a
`token is created, additional security measures are desirable
`to prevent reverse-engineering through statistical analysis
`attacks.
`By law, Protected Healthcare Information (PHI) cannot be
`freely disseminated. However, if properly de-identified to
`the point where the risk is minimal that an individual could
`be re-identified, the PHI can be disclosed by a covered entity
`or an entity in legal possession of PHI.
`
`SUMMARY OF THE INVENTION
`
`Generally, it is an object of the present invention to
`provide a system and method for de-identifying data that
`overcomes some or all of the above-described deficiencies
`of the prior art.
`According to a preferred embodiment, provided is a
`computer-implemented method for de-identifying data by
`creating tokens through a cascading algorithm, comprising:
`processing at least one record comprising a plurality of data
`elements to identify a subset of data elements comprising
`data identifying at least one individual; generating, with at
`least one processor, a first hash by hashing at least one first
`data element with at least one second data element of the
`subset of data elements; generating, with at least one pro(cid:173)
`cessor, a second hash by hashing the first hash with at least
`one third data element of the subset of data elements;
`creating at least one token based at least partially on the 60
`second hash or a subsequent hash derived from the second
`hash, wherein the token identifies the at least one individual;
`and associating at least a portion of a remainder of the data
`elements of the plurality of data elements with the at least
`one token.
`According to another preferred embodiment, provided is
`a system for de-identifying data, comprising: a data supplier
`
`2
`computer compnsmg at least one processor and a de(cid:173)
`identification engine, the de-identification engine configured
`to: (i) process a data record comprising a plurality of data
`elements, wherein a subset of data elements of the plurality
`of data elements comprises identifying information; (ii)
`generate a token based at least partially on a series of hashes
`of individual data elements of the subset of data elements,
`wherein a plurality of hashes in the series of hashes are
`based at least partially on a previous hash in the series of
`10 hashes; (iii) encrypt at least the token to generate an
`encrypted token; (b) a data processing entity computer
`remote from the data supplier computer, the data processing
`computer comprising at least one processor configured to: (i)
`receive the encrypted token and unencrypted data elements
`15 from the data supplier computer; (ii) decrypt the encrypted
`token, resulting in the token; (iii) link the token and unen(cid:173)
`crypted data elements with at least one other record based at
`least partially on the token.
`According to a further preferred embodiment, provided is
`20 a de-identification system, comprising: a de-identification
`subsystem comprising at
`least one computer-readable
`medium containing program instructions which, when
`executed by at least one remote processor at a data supplier,
`causes the at least one remote processor to: create a token
`25 from at least one record, the token created by performing at
`least one hash operation on at least one data element of at
`least one record, wherein the at least one data element
`comprises personally-identifying information; encrypt the
`token with a randomly-generated encryption key, forming an
`30 encrypted token; and encrypt the encrypted token and the
`randomly-generated encryption key with a public key, form(cid:173)
`ing encrypted data; and a record processing subsystem
`comprising a server and at least one computer-readable
`medium containing program instructions which, when
`35 executed by at least one processor, causes the at least one
`processor to: receive the encrypted data; decrypt the
`encrypted data with a private key corresponding to the
`public key, resulting in the randomly-generated encryption
`key and the encrypted token; and decrypt the encrypted
`40 token with the randomly-generated encryption key.
`According to another preferred embodiment, provided is
`a de-identification engine for de-identifying at least one
`record comprising a plurality of data elements, wherein a
`subset of the plurality of data elements comprise personally-
`45 identifying data, the de-identification engine comprising at
`least one computer-readable medium containing program
`instructions that, when executed by at least one processor of
`at least one computer, cause the at least one computer to: (a)
`generate an initial hash by hashing at least one key and a first
`50 data element of the subset of data elements; (b) generate a
`next hash by hashing a next data element of the subset of
`data elements with a previous hash value generated by
`hashing at least a previous data element of the subset of data
`elements; and ( c) repeat step (b) for all data elements of the
`55 subset of data elements, resulting in a final hash value.
`These and other features and characteristics of the present
`invention, as well as the methods of operation and functions
`of the related elements of structures and the combination of
`parts and economies of manufacture, will become more
`apparent upon consideration of the following description
`and the appended claims with reference to the accompany-
`ing drawings, all of which form a part of this specification,
`wherein like reference numerals designate corresponding
`parts in the various figures. It is to be expressly understood,
`65 however, that the drawings are for the purpose of illustration
`and description only and are not intended as a definition of
`the limits of the invention. As used in the specification and
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 9 of 14
`
`
`
`US 9,614,814 B2
`
`3
`the claims, the singular form of"a", "an", and "the" include
`plural referents unless the context clearly dictates otherwise.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a schematic diagram for a system for de(cid:173)
`identifying data according to the principles of the present
`invention;
`FIGS. 2A-2C are schematic diagrams for a cascading
`hash process for de-identifying data according to the prin(cid:173)
`ciples of the present invention;
`FIGS. 3A and 3B are flow diagrams for a system and
`method for de-identifying data according to the principles of
`the present invention; and
`FIG. 4 is a further flow diagram for a system and method
`for de-identifying data according to the principles of the
`present invention.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`
`For purposes of the description hereinafter, it is to be
`understood that the invention may assume various alterna(cid:173)
`tive variations and step sequences, except where expressly
`specified to the contrary. It is also to be understood that the
`specific devices and processes illustrated in the attached
`drawings, and described in the following specification, are
`simply exemplary embodiments of the invention. Hence,
`specific dimensions and other physical characteristics
`related to the embodiments disclosed herein are not to be
`considered as limiting.
`As used herein, the terms "communication" and "com(cid:173)
`municate" refer to the receipt, transmission, or transfer of
`one or more signals, messages, commands, or other type of
`data. For one unit or device to be in communication with
`another unit or device means that the one unit or device is
`able to receive data from and/or transmit data to the other
`unit or device. A communication may use a direct or indirect
`connection, and may be wired and/or wireless in nature.
`Additionally, two units or devices may be in communication
`with each other even though the data transmitted may be
`modified, processed, routed, etc., between the first and
`second unit or device. It will be appreciated that numerous
`other arrangements are possible.
`In a preferred and non-limiting embodiment of the present
`invention, provided is a system for de-identifying data that
`includes a de-identification engine configured to hash per(cid:173)
`sonally identifying data within a data record, while at the
`same time passing through non-identifying data ( e.g., a refill
`number and/or the like). In this way, the system has the
`ability to perform data cleansing operations (e.g., justifica(cid:173)
`tion, padding, range checking, character set validation, date
`cleaning, zoned decimal conversion, and/ or the like), data
`derivation (e.g., ages, combinations of fields, and/or the
`like), and/or data translation ( e.g., state abbreviations to state
`names, or the like). Various other formatting and normal(cid:173)
`ization functions are also possible.
`To create a unique identifier for an individual (i.e., a
`patient, a consumer, or the like), the de-identification engine
`of the present invention may support configurable standard(cid:173)
`ization and hashing of fields. By using multiple fields to
`create a unique identifier, the system of the present invention
`ensures that statistical analysis or other reverse-engineering
`techniques cannot be performed on the hashed values to
`determine a person's identity. For example, applying a
`hashing algorithm ( e.g., SHA-3 or other hashing algorithms)
`to the first name "John" will produce a secure token that
`
`4
`cannot be reversed back to the name "John," but potentially
`allows for a statistical analysis operation to be performed to
`determine that the most frequent first name hash token
`represents the name "John." A similar analysis could be
`performed on other non-unique fields as well. For that
`reason, multiple fields are used to create a distinct ( or
`sufficiently distinct) de-identification value. For example,
`using a first name, last name, date of birth, and zip code may
`be considered sufficiently distinct to prevent statistical
`10 cracking.
`Referring now to FIG. 1, a system 1000 for de-identifying
`data is shown according to a preferred and non-limiting
`embodiment. A data supplier 103 is in communication with
`a raw data storage unit 104, which may include one or more
`15 data storage devices. The raw data storage unit 104 may
`comprise one or more data structures, such as tables, data(cid:173)
`bases, and/or the like, including records personally identi(cid:173)
`fying individuals. The data supplier 103 includes one or
`more computers, such as servers, user terminals, processors,
`20 and/or the like, and a de-identification engine 107 that
`executes on one or more of the data supplier 103 computers.
`The de-identification engine 107 may include compiled
`program instructions capable of being executed on a data
`supplier 103 computer and configured to process data
`25 records from the raw data storage unit 104. The data supplier
`103 is also given access to a configuration file 105, a
`signature file, and a public key for use in the de-identifica(cid:173)
`tion process. The data supplier 103 may be one of many data
`suppliers associated with a particular client 106, and mul-
`30 tiple clients may each be associated with multiple data
`suppliers. It will be appreciated that other arrangements are
`possible.
`With continued reference to FIG. 1, a data processing
`entity 108 is shown in communication with the data supplier
`35 103 through a network environment 112, such as the Internet
`or any direct or indirect network connection. The data
`processing entity 108 is in communication with a de(cid:173)
`identification data storage unit 111 and includes one or more
`computers capable of executing a matching engine 109 and
`40 a token processing engine 110. The matching engine 109
`and/or token processing engine 110 may include compiled
`program instructions capable of being executed on a data
`processing entity 108 computer. The token processing
`engine 110 may be configured to receive output from the
`45 data supplier 103 and, as explained further below, perform
`additional operations on the token or encrypted output such
`as, but not limited to, decrypting encrypted output data and
`hashing the token generated by the de-identification engine
`107 with a seed/key unique to the client 106 and/or data
`50 supplier 103 to produce a new token.
`Still referring to FIG. 1, the matching engine 109 may be
`configured to match tokens among de-identified records,
`received from the data supplier 103, with other records in the
`de-identification data storage unit 111. For example, the
`55 matching engine 109 may use the tokens generated or output
`by the de-identification engine 107, or the new tokens
`generated or output by the token processing engine, to match
`the records received with a unique individual, and to link the
`record to that individual. The de-identification data storage
`60 unit 111 may include one or more data storage devices that
`comprise one or more data structures such as tables, data(cid:173)
`bases, and/or the like. The system 1000 is distributed such
`that the data supplier is in a location 115 remote from a
`location 113 of the data processing entity 108. In this way,
`65 the raw data can be de-identified.
`In a preferred and non-limiting embodiment, a cascading
`hash process is used to generate a de-identified token. A
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 10 of 14
`
`
`
`US 9,614,814 B2
`
`5
`cascading hash process may increase token security against
`attacks from crackers and hackers. Instead of concatenating
`multiple fields, adding a secret seed, and then hashing to
`form a token, the cascading hash process forms a token
`through a series of hashes involving each individual field. 5
`This polyphasic operation works by hashing data fields or
`elements of a record individually in a chain, such that each
`subsequent hash depends upon a previous hash result.
`Referring now to FIGS. 2A-2C, a cascading hash process
`is depicted according to a preferred and non-limiting 10
`embodiment. A record 200 containing a number of data
`fields or elements 201, 203, 204, 205 that include identifying
`data is provided. Once these data fields or elements are
`identified, generally with business rules customized to a
`particular data supplier, the token creation process is started. 15
`Referring specifically to FIG. 2A, an initial key 221 is
`hashed with a first data field 201 to produce a first hash 211.
`The first hash 211 is then hashed with a second data field 203
`to produce a second hash 213. The second hash 213 is then
`hashed with a third data field 204 to produce a third hash
`215. This process may continue for as many data fields as
`required, resulting in a hashed token 219 derived directly
`from the last hashed data field and, as a result of the cascade,
`derived indirectly from the first hash 211, second hash 213,
`and any intervening hashes. In the example shown in FIG.
`2A, the fourth data field 205 is hashed with the third hash
`215 to produce the token 219.
`With continued reference to FIGS. 2A-2C, it will be
`appreciated that the hash function 220 (shown in FIG. 2C)
`may include other inputs, keys, and/or the like, in addition
`to a previous hash result. For example, in the non-limiting
`example shown in FIG. 2B, an initial key 221 is used to hash
`the first data field 201, and subsequent data fields 203, 206
`are hashed with a previous hashed value as well as a hash
`key 223. In this example, the second data field 203 is hashed
`with the first hash 211 and the hash key 223 as inputs to a
`hashing function that results in the second hash 213.
`Depending on the number of data fields used, generally as
`defined by the business rules for a particular data source, the
`process may be repeated. As shown in FIG. 2B, the Nth hash
`217 is derived from the sequence of hashes preceding it and
`is used, along with hash key 223, to hash the Nth+ 1 data field
`206 to create the token 219.
`Due to the nature of the cascading process, the final token
`219 produced is unique for the data fields 201, 203, 206 but,
`unlike traditional concatenation-based methods,
`is not
`merely a hashed version of all of the data fields combined.
`Rather, with the cascading token generation process, a
`nested or cascaded token is produced that can only be
`derived from the series of hashes and data fields in a record
`200. In the non-limiting embodiment shown in FIG. 2B, for
`example, an initial key 221 may differ from a hash key 223
`used in subsequent iterations of the sequence. However, it
`will be appreciated that the hash key 223 and the initial key
`221 may be the same and, in some embodiments, further
`hash keys 223 may not be used after the initial key 221.
`Those skilled in the art will appreciate that various other
`arrangements are possible.
`Referring to FIG. 2C, a cascading hash process is shown
`according to a further preferred and non-limiting embodi(cid:173)
`ment. The hash function 220, not separately shown in FIGS.
`2A-2B, is depicted in FIG. 2C as receiving inputs and
`outputting results. The hash function 220 takes, as inputs, a
`key 223 and a first data field 201. The output of the hash
`function 220 in this example is input back into itself (i.e.,
`recursively) along with a second data field 203. Similarly,
`the next output of the hash function 220 is input back into
`
`6
`the hash function 220 again, along with a third data field
`204. This is repeated as many times as necessary, depending
`on how many data fields 201, 203, 204, 205 will be used in
`creating the token 219. The final hash results in the token
`219. It will be appreciated that the key 223, or a different
`key, may also be used as inputs to subsequent iterations of
`the hash function 220.
`Referring to FIGS. 1 and 2C, in a preferred and non(cid:173)
`limiting embodiment, a SHA-3 algorithm is used as the hash
`function 220 to create tokens 219. However, through the use
`of the de-identification engine 107 and configuration file
`105, new and/or different algorithms and methodologies
`may be easily implemented. To increase security and data
`quality, the SHA-3 hashing algorithm may be configured to
`return spaces (fixed output) or null (delimited output)
`instead of a hash value if any of the component fields are not
`populated or contain all spaces.
`In a preferred and non-limiting embodiment, and with
`reference to FIG. 1, it is envisioned that many clients 106
`20 may be licensed to use the de-identification engine 107, and
`that each client may have a number of data suppliers 103.
`Therefore, it is desirable to provide unique tokens for each
`of the clients 106 or, in other embodiments, each of the data
`suppliers 103. This uniqueness may be provided, at least in
`25 part, through the use of the configuration file 105. In
`particular, the configuration file 105 may include a client tag
`(e.g., a client code or client key) to use in the token creation
`process. The client tag may be combined, incorporated,
`XORed, or used as an input to a hashing function for each
`30 data field. Alternatively, the client tag may be used as the
`initial input ( e.g., initial key) for the first hash operation, and
`subsequent hash operations may use the previous hash
`result.
`Through the use of client-specific tags, data records
`35 processed for one client 106 will not produce the same
`tokens as identical data records processed for a different
`client. In a preferred and non-limiting embodiment, the
`client name is stored in the configuration file 105 and, based
`on the client name, the client tag is generated or created. In
`40 this way, the actual value being used as the client tag will not
`be discernable to the data supplier 103. However, it will be
`appreciated that the client name itself may be used as a key
`and that, in other embodiments, the client tag may be known
`by the data supplier 103. Other arrangements and configu-
`45 rations are possible.
`In a preferred and non-limiting embodiment, and with
`continued reference to FIG. 1, once the de-identification
`engine 107 at the data supplier 103 creates a token, the token
`(as well as the remainder of the record) must then be
`50 transmitted to the data processing entity 108 as one or more
`output files. To do so, further layers of encryption (e.g.,
`token masking) may be provided. For example, the data
`supplier 103 may generate a transient encryption key and
`initialization vector unique to the session. The transient
`55 encryption key and initialization vector may be generated
`randomly in any number of ways. In a non-limiting embodi(cid:173)
`ment, the transient encryption key may include a 128 bit key,
`and the encryption algorithm for the transient layer of
`encryption may include an Advanced Encryption Standard
`60 (AES) algorithm. However, various other arrangements,
`algorithms, and configurations are possible.
`After encrypting the token with the transient encryption
`key, the encrypted token and the transient key may be
`encrypted together using, for example, a public key of the
`65 data processing entity 108 that corresponds to a private key
`held secretly by the data processing entity 108. In some
`non-limiting embodiments, the generated transient encryp-
`
`DATAVANT, INC. EXHIBIT NO. 1001
`Page 11 of 14
`
`
`
`US 9,614,814 B2
`
`7
`tion key and initialization vector may be stored in a de(cid:173)
`identification log file after being encrypted using the public
`key. Un-hashed output fields may remain unchanged so that
`the data supplier 103 is able to verify the content and verify
`that no personally identifiable data is being sent in the output
`files. Yet another layer of data security may be applied by
`transmitting the output files from the data supplier 103 to the
`data processing entity 108 over a secure transmission pro(cid:173)
`tocol such as SFTP or HTTPS, as examples.
`Once the public key is used to encrypt the encrypted
`token, the transient key, and the initialization vector, the
`encrypted data is transmitted to the data processing entity
`108 as one or more output files. Once received, the data
`processing entity 108 (and particularly the token processing
`engine 110 of the data processing entity 108) uses the private
`key corresponding to the public key used by the data
`supplier 103 to decrypt the last layer of encryption and to
`obtain the encrypted token, the transient key, and the ini(cid:173)
`tialization vector. The transient key is used to decrypt the
`encrypted token, resulting in the original token that resulted 20
`from the cascading hash process. Once the token 219 is
`obtained, the data processing entity 108 may perform an
`additional hash operation on the token 219 with a seed/key
`that is unique to either the client 106 or the data supplier 103
`of the client 106. In some non-limiting embodiments, the 25
`data processing entity 108 may always perform the addi(cid:173)
`tional ha



