`Stolfo
`
`11111 (cid:9)
`
`11111 (cid:9)
`
`11111 11111 (cid:9)
`005668897A
`[11] Patent Number: (cid:9)
`
`11113 11111
`
`5,668,897
`
`[54] METHOD AND APPARATUS FOR IMAGING,
`IMAGE PROCESSING AND DATA
`COMPRESSION MERGE/PURGE
`TECHNIQUES FOR DOCUMENT IMAGE
`DATABASES
`
`[76] Inventor: Salvatore J. Stolfo, 80 Kenilworth Rd.,
`Ridgewood, N.J. 07450
`
`[21] Appl. No.: 488,333
`
`[22] Filed: (cid:9)
`
`Jun. 7, 1995
`
`5,140,650 8/1992 Casey et al. (cid:9)
`5,182,656 1/1993 Chevion et al. (cid:9)
`5,204,756 4/1993 Chevion et al. (cid:9)
`5,394,487 2/1995 Burger et al. (cid:9)
`5,497,486 3/1996 Stolfo (cid:9)
`
` 382/283
` 382/284
` 382/284
` 382/209
` 395/600
`
`FOREIGN PATENT DOCUMENTS
`
`2263603 7/1993 United Kingdom (cid:9)
`
` 382/284
`
`Primary Examiner—Jose L. Couso
`Attorney, Agent, or Finn—Mitchell A. Stein; Stein &
`Associates, P.C.
`
`Related U.S. Application Data
`
`[57] (cid:9)
`
`ABSTRACT
`
`[63] Continuation-in-part of Ser. No. 259,527, Jun. 14, 1994,
`which is a continuation-in-part of Ser. No. 224,273, Apr. 7,
`1994, abandoned, and a continuation-in-part of Ser. No.
`213,795, Mar. 15, 1994, Pat. No. 5,497,486.
` GO6K 9/36
`[51] Int. C1.6 (cid:9)
` 382/283
`[52] U.S. Cl. (cid:9)
`382/135, 137,
`[58] Field of Search (cid:9)
`382/138, 139, 209, 218, 282, 232, 283,
`317, 284; 358/435, 436
`
`[56] (cid:9)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`4,542,378 9/1985 Suganuma et al. (cid:9)
`4,953,229 8/1990 Abe et al. (cid:9)
`5,001,769 3/1991 Reid-Green et al. (cid:9)
`
` 382/284
` 382/284
` 382/283
`
`A method for processing an image, consisting of a fore-
`ground and a background, to produce a highly compressed
`and accurate representation of the image, including the steps
`of scanning the image to create a digital image of the image,
`comparing the digital image against a codebook of stored
`digital images; matching the digital image with one of the
`stored digital images of the codebook; producing an index
`code identifying the background of the stored digital image
`as having matched the digital image; subtracting the stored
`digital image from the digital image to produce a second
`digital image representing the foreground of the stored
`digital image; and storing the second digital image with the
`index code. Techniques are also provided to enable merge/
`purge of the database(s) thereby created.
`
`18 Claims, 18 Drawing Sheets
`
`SCAN CHECK AND
`GENERATE IMAGE
`
`COMPARE IMAGE
`TO DATABASE
`
`—3
`
`AND MATCH
`
`7
`SUBTRACT MATCH TO —
`GENERATE SECOND IMAGE
`
`GENERATE NDEX CODE
`IDENTIFYING MATCHED IMAGE
`
`STORE CODE AND
`SECOND IMAGE
`
`—11
`
`Page 1 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`U.S. Patent (cid:9)
`
`Sep. 16, 1997 (cid:9)
`
`Sheet 1 of 18 (cid:9)
`
`5,668,897
`
`SCAN CHECK AND
`GENERATE IMAGE
`
`COMPARE IMAGE
`TO DATABASE
`
`FIND MATCH
`
`SUBTRACT MATCH TO
`GENERATE SECOND IMAGE
`
`(cid:9) 9
`GENERATE INDEX CODE
`IDENTIFYING MATCHED IMAGE
`
`STORE CODE AND
`SECOND IMAGE
`
`FIG. 1
`
`Page 2 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`U.S. Patent (cid:9)
`
`Sep. 16, 1997 (cid:9)
`
`Sheet 2 of 18 (cid:9)
`
`5,668,897
`
`SCAN CHECK
`
`DETERMINE
`BACKGROUND
`
`DETERMINE
`FOREGROUND
`
`/ 6
`
`ZEBRA?
`
`COMPARE BACKGROUND1,_8
`TO DATABASE
`
`FRAUDULENT
`DATA
`
`18
`
`22
`
`REJECT
`
`10 (cid:9) (12
`ADAPT
`N
`DATABASE
`
`MATCH?
`
`SUBTRACT
`BACKGROUND
`
`c-24
`TRANSMIT
`
`FIG. 2
`
`Page 3 of 43
`
`MINDGEEK EXHIBIT 1005
`
`(cid:9)
`
`
`lualud *S71
`
`81 JO c palls
`
`DA E—....._..\ (cid:9)
`
`DOLLA
`
`48
`
`34
`
`(...........-----------5o
`
`44 (cid:9)
`
`,-28 (cid:9)
`
`30 (cid:9)
`
`46
`
`32
`
`0001
`
`26
`
`---John Smith
`12 Main Street
`Anytown, USA
`
`42---.
`
`54--
`
`Pay To The
`Order of
`
`tY BANK
`4 Broadway
`Anytown, USA
`40 (cid:9)
`38___,MEMO (cid:9)
`
` SIGNATURE (cid:9)
`1:231406215- 0001 34 3:0001
`52
`
`36
`
`56
`FIG. 3
`
`Page 4 of 43
`
`MINDGEEK EXHIBIT 1005
`
`(cid:9)
`
`
`U.S. Patent (cid:9)
`
`Sep. 16, 1997 (cid:9)
`
`Sheet 4 of 18 (cid:9)
`
`5,668,897
`
`SCANCHECKI-58
`
`60
`
`IS THERE
`AN IDENTIFYING
`CODE
`
`Y
`
`(62
`CHECK CODE
`AGAINST DATABASE
`
`64
`
`CODE
`FOUND
`
`BACKGROUND
`PATTERN
`
`76
`N
`
`DOES
`BACKGROUND
`PATTERN
`REPEAT
`
`1
`MULTIPROCESSING TASKS
`SIGNATURE DATE
`PRE-PRINTED
`INFORMATION
`90
`IMAGE PROCESS
`THE DATE
`
`CHECK
`AMOUNT
`
`NET
`
`COMPRESS ALL PREPRINTED
`INFORMATION TO TEXT CHARACTERS
`
`STALE
`OR INVALID
`
`94
`
`INFORM
`OPERATOR
`
`RESTRICT DATABASE
`SEARCH TO PATTERN SEG
`
`COMPRESS SIGNATURE TO
`B-SPLINE CONTROL POINTS
`
`COMPARE SIGNATURE WITH
`THOSE IN DATABASE
`
`96
`
`N
`
`IS
`DATE
`HAND-
`WRITTEN
`
`98)
`
`Y 100
`)
`COMPRESS DATE
`COMPRESS DATE
`TOD-SPLINE
`TO TEXT
`CONTROL POINTS
`CHARACTERS
`
`VALID
`MATCH
`?
`
`INFORM OPERATOR OF POSSIBLE
`FRAUDULENT CHECK
`
`FIG. 4
`
`Page 5 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`U.S. Patent (cid:9)
`
`Sep. 16, 1997 (cid:9)
`
`Sheet 5 of 18 (cid:9)
`
`5,668,897
`
`LOOK UP IDENTIFYING CODE IN CODEBOOK
`DATABASE
`
`TAG BACKGROUND WITH CODEBOOK CODE (cid:9) -70
`
`REQUEST FULL BACKGROUND COMPARISON1-72
`
`RETURN
`
`FIG. 5
`
`Page 6 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`U.S. Patent (cid:9)
`
`Sep. 16, 1997 (cid:9)
`
`Sheet 6 of 18 (cid:9)
`
`5,668,897
`
`DETERMINE CHARACTERISTICS
`OF PATTERN
`
`_--102
`
`DEFINE SUBSET OF DATABASE SEARCH
`BASED ON CHARACTERISTICS
`
`104
`
`SEARCH DATABASE
`
`-106
`
`110
`
`108
`
`Y
`
`112
`
`WITHIN
`TOLERANCE
`
`Y (cid:9)
`
`REPLACE BACKGROUND
`WITH CODE FROM
`CODEBOOK
`
`11;4 11)6
`) (cid:9)
`ADD ANY
`ERROR
`OR DEVIANCES
`
`REPLACE BACKGROUND
`WITH CODE FROM (cid:9)
`CODEBOOK (cid:9)
`
`120
`
`REPLACE BACKGROUND
`y (cid:9)
`PATTERN WITH COMBINATION
`OF CODES FROM CODEBOOK
`
`(-122
`ADD ANY
`ERROR
`OR DEVIANCES
`
`124
`
`118
`
`MATCH
`COMBINATION OF
`PATTERNS
`
`N
`
`PEFORM STANDARD DATA
`COMPRESSION ON
`ENTIRE BACKGROUND
`
`FIG. 6
`
`(13ETURN
`
`.)
`
`
`
`Page 7 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`SORTED-NEIGHBORHOOD METHOD (1M RECORDS + 423644 DUPLICATES
`)
`
`KEY #1 (LAST NAME) -*-
`KEY #2 (FIRST NAME)
`KEY #3 (ST. ADDR.)
`X-CLOSURE OVER 3 KEYS
`x (cid:9)
`x (cid:9)
`............................. .................. x
`
`
`
`
`
`x_
`
`x...... (cid:9)
`
`....
`
`..... X ..... X ..... (cid:9)
`
`100
`
`90
`86.1
`
`80
`
`x
`
`DUPLICATES
`DETECTED 70
`(0/0)
`
`81 JO L WIN
`
`.......... .............. * ........
`............
`
`........ (cid:9)
`...........
`......
`...........
`
`................... 0 (cid:9)
`................
`
`
`
`20 (cid:9)
`40
`30 (cid:9)
`WINDOW SIZE (RECORDS)
`FIG. 7A
`
`50
`
`60
`
`* (cid:9)
`
`+-+- •
`
`60
`
`50
`
`40
`
`0 (cid:9)
`
`10 (cid:9)
`
`Page 8 of 43
`
`MINDGEEK EXHIBIT 1005
`
`(cid:9)
`(cid:9)
`(cid:9)
`(cid:9)
`(cid:9)
`
`
`lualud °S9f1
`
`81 Jo 8 hags
`
`100
`
`90
`
`80
`
`DUPLICATES
`DETECTED 70
`(%)
`
`60
`
`50
`
`40
`
`0 (cid:9)
`
`CLUSTERING METHOD (1M RECORDS + 423644 DUPLICATES)
`KEY #1 (LAST NAME)
`KEY #2 (FIRST NAME) --+---•
`
`KEY #3 (ST. ADDR.)
`
`X-CLOSURE OVER 3 KEYS
`
`x.........x........x.......x...,.....x,.......x....................x
`
`.................. ............................. ...........................
`
`...... D .............. a ...........................................
`
`...................... .................... .13
`
`0
`
`...... ..... (cid:9)
`
`..... (cid:9)
`
`..... (cid:9)
`
`..... (cid:9)
`
`10 (cid:9)
`
`20 (cid:9)
`30
`WINDOW SIZE (RECORDS)
`FIG. 7B
`
`40
`
`50
`
`Page 9 of 43
`
`MINDGEEK EXHIBIT 1005
`
`(cid:9)
`(cid:9)
`(cid:9)
`(cid:9)
`(cid:9)
`(cid:9)
`
`
`jualud °S11
`
`81 JO 6 paqs
`
`0.2 SORTED-NEIGHBORHOOD METHOD (1M RECORDS + 423644 DUPLICATES)
`
`0.15
`
`FALSE
`POSITIVES 0.1
`(%)
`
`0.05
`
`•
`
`•
`
`40
`
`0
`
`10 (cid:9)
`
`KEY #1 (LAST NAME) -•---
`-x
`KEY #2 (FIRST NAME) --+--
`
`-KEY #3 (ST. ADDR.) -)3,•••
`X-CLOSURE OVER 3 KEYS---x- -
`
`,x'
`
`..
`.......
`.... (cid:9)
`.... (cid:9)
`
`......... ..... .......
`
`30 (cid:9)
`20 (cid:9)
`WINDOW SIZE (RECORDS)
`FIG. 8
`
`40
`
`50
`
`Page 10 of 43
`
`MINDGEEK EXHIBIT 1005
`
`(cid:9)
`
`
`wawa
`
`81 JO in pails
`
`SORTED-NEIGHBORHOOD METHOD (1M RECORDS + 423644 DUPLICATES, P=1)
`
`KEY #1 (LAST NAME)
`KEY #2 (FIRST NAME) --+---
`KEY #3 (ST. ADDRS.) •••0••••
`THREE RUNS + X-CLOSURE ---x----
`
`TIME
`(S)
`
`12000
`
`10000
`
`8000
`
`6000
`
`4000
`
`2000
`
`40
`
`0 (cid:9)
`
`5 (cid:9)
`
`20
`15 (cid:9)
`10 (cid:9)
`WINDOW SIZE (RECORDS)
`FIG. 9A
`
`25
`
`30
`
`Page 11 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`wawa °Sil
`
`81 JO TT loatIS
`
`CLUSTERING METHOD (1M RECORDS + 423644 DUPLICATES, P=1)
`KEY #1 (LAST NAME) --•--
`KEY #2 (FIRST NAME) --+---
`KEY #3 (ST. ADDRS.)
`THREE RUNS + X-CLOSURE
`
`x
`
`x--
`
`.x•
`
`TIME
`(S)
`
`12000
`
`10000
`
`8000
`
`6000
`
`4000
`
`2000
`
`40
`
`0 (cid:9)
`
`5 (cid:9)
`
`20
`15 (cid:9)
`10 (cid:9)
`WINDOW SIZE (RECORDS)
`FIG. 9B
`
`25
`
`30
`
`Page 12 of 43
`
`MINDGEEK EXHIBIT 1005
`
`(cid:9)
`
`
`wawa °S11
`
`cl)
`1:3
`
`81 JO Z1 lam1S
`
`00
`.0
`
`100 (cid:9)
`
`I I Illf (cid:9)
`
`I I 111111 (cid:9)
`
`I 1 111.11 (cid:9)
`
`IDEAL AND REAL PERFORMANCE OF THE SORTED-NEIGHBORHOOD METHOD
`
`4c-xX)OCXX- ---x- -X -X-x-X x")001-x")001- - -X- -X -X-X-)00004- -'X- --X-X-X
`x
`
`111111 (cid:9)
`
`I
`
`90 -
`
`DUPLICATES _In
`FOUND u -
`(%)
`
`4-+ (cid:9)
`
`01:1
`
`.
`
`..a.
`...a-
`
`60
`
`50 -
`
`40
`
`0013Cla3 IDEAL PERFORMANCE KEY #1
`Q, - (cid:9)
`IDEAL PERFORMANCE KEY #2 --+---
`"
`:rf: se*.
`IDEAL PERFORMANCE KEY #3 —13 — •
`IDEAL X-CLOSURE OVER ALL 3 KEYS -*—
`REAL PERFORMANCE KEY #1 --b---- -
`REAL PERFORMANCE KEY #2 •-x----
`REAL PERFORMANCE KEY #3
`REAL X-CLOSURE OVER 3 KEYS
`
`I I tit!
`
`I I mil (cid:9)
`
`I (cid:9)
`
`III till (cid:9)
`
`I (cid:9)
`
`I (cid:9) 111'111
`
`I (cid:9)
`
`I
`
`10 (cid:9)
`
`100 (cid:9)
`1000 (cid:9)
`WINDOW SIZE (LOG SCALE)
`FIG. 10
`
`10000
`
`Page 13 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`waled °S'fl
`
`81 JO £1 PaqS
`
`2000
`
`1800
`
`1600
`
`1400
`
`1
`TIME 1200
`
`(S)
`
`1000
`
`800
`
`600
`
`400
`
`SORTED-NEIGHBORHOOD METHOD: 1000000 RECORDS (W=10)
`
`KEY #1 (LAST NAME)
`KEY #2 (FIRST NAME) --+---
`KEY #3 (ST. ADDRS.)
`XCLOSURE OVER 3 KEYS-
`
`2 (cid:9)
`3 (cid:9)
`4 (cid:9)
`5
`NUMBER OF PROCESSORS PER RUN
`FIG. 11A
`
`6
`
`Page 14 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`jtialUd •Sil
`
`L661 `9I "claS
`
`81 J0 171 PatIS
`
`2000
`
`1800
`
`1600
`
`1400
`
`1200
`TIME
`(S)
`
`1000
`
`800
`
`600
`
`400
`
`CLUSTERING METHOD: 1000000 RECORDS (W=10)
`
`KEY #1 (LAST NAME) -.-
`KEY #2 (FIRST NAME) --+---
`KEY #3 (ST. ADDRS.) --.0.-
`XCLOSURE OVER 3 KEYS
`
`"S.
`
`1
`
`5
`4 (cid:9)
`3 (cid:9)
`2 (cid:9)
`NUMBER OF PROCESSORS PER RUN
`FIG. 11B
`
`6
`
`................
`
`Page 15 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`ltlaind 'Sill
`
`81 JO ST WEIS
`
`CLUSTERING METHOD
`
`10% DUPLICATES
`30% DUPLICATES --+---
`50% DUPLICATES ••.0 (cid:9)
`
`
`.EI'
`
`.••••
`.•••••
`
`.•••••••••
`
`..-'••••••
`
`3500
`
`3000
`
`2500
`
`2000
`
`1500
`
`1000
`
`500
`
`TIME
`(S)
`
`0.4 (cid:9)
`
`0.6 (cid:9)
`
`1 (cid:9)
`0.8 (cid:9)
`1.2 (cid:9)
`1.4
`TOTAL NUMBER OF RECORDS (X 1M)
`FIG. 12A
`
`1.6
`
`1.8
`
`Page 16 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`luaTud 'S'fl
`
`81 JO 91 WIN
`
`SORTED-NEIGHBORHOOD METHOD
`
`10% DUPLICATES
`30% DUPLICATES --I--
`50% DUPLICATES
`
`.•
`
`•
`
`..e.
`
`3500
`
`3000
`
`2500
`
`2000
`
`1500
`
`1000
`
`500
`
`TIME
`(S)
`
`0.4 (cid:9)
`
`0.6 (cid:9)
`
`1.4 (cid:9)
`1.2 (cid:9)
`1 (cid:9)
`0.8 (cid:9)
`TOTAL NUMBER OF RECORDS (X 1M)
`FIG. 126
`
`1.6
`
`1.8
`
`Page 17 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`U.S. Patent (cid:9)
`
`Sep. 16, 1997 (cid:9)
`
`Sheet 17 of 18 (cid:9)
`
`5,668,897
`
`DATA
`BASE
`
`COMPUTING
`A KEY
`
`204-/
`
`PARALLEL
`MERGE
`SORTING
`
`COMPARING
`
`STORING
`IDENTIFIERS
`
`FIG. 13
`
`Page 18 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`U.S. Patent (cid:9)
`
`Sep. 16, 1997 (cid:9)
`
`Sheet 18 of 18 (cid:9)
`
`5,668,897
`
`STORED
`IDENTIFIERS
`1
`
`212
`
`STORED
`IDENTIFIERS
`2
`
`214
`
`UNION
`
`216--/ (cid:9)
`
`TRANSITIVE
`CLOSURE
`
`218
`
`FIG. 14
`
`Page 19 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`5,668,897
`
`1
`METHOD AND APPARATUS FOR IMAGING,
`IMAGE PROCESSING AND DATA
`COMPRESSION MERGE/PURGE
`TECHNIQUES FOR DOCUMENT IMAGE
`DATABASES
`
`This application is a continuation-in-part of copending
`U.S. patent application Ser. No. 08/259,527, filed Jun. 14,
`1994 pending, which is a continuation-in-part of U.S. patent
`application Ser. No. 08/224,273, filed Apr. 7, 1994 now 10
`abandoned, and a continuation-in-part of U.S. patent appli-
`cation Ser. No. 08/213,795, filed Mar. 15, 1994 now U.S.
`Pat. No. 5,479,486.
`
`15 (cid:9)
`
`2
`are indicated electronically on daily balance sheets, the
`document itself, under strictures of law and custom which
`originated hundreds of years ago, actually passes and is
`returned to the maker. Electronic funds transfers are also
`5 available, but these transfers do not necessarily require a
`written authorization granted to the recipient of the funds,
`and thus do not pose the same paper handling problems.
`Accordingly, in the area of check clearance, and, as well,
`with respect to the other instruments, items and documents,
`the physical document and its possession and transfer are
`important, since the funds are withdrawn from the drawer or
`maker's account while the check is returned to the maker or
`drawer. Thus, the paper check must generally be physically
`transferred through the banking system.
`The check typically originates from a printing house and
`contains customary preprinted information that is identical
`from one check of a given style to the next. The check
`becomes a legally operative document upon the inclusion of
`handwritten or post-printed information, which also renders
`20 the document unique and provides for its special place in the
`collection process.
`The originator of the check transmits the check to a point
`of collection (e.g., a lock-box operation that handles bulk
`25 mailings, or to the payee directly). The relevant information
`is verified by the payee and the check is endorsed and
`delivered to the bank for deposit to an account. At this point
`in the process, electronic notation of the transaction is
`performed, while the paper media is physically collected and
`30 sent through a clearing system. The paper check is then
`sorted and prepared for delivery to the originator's bank.
`The electronic information is used to net out and transfer
`funds between banks by the clearinghouse system. The
`paper check is then sent to the originator bank for sorting,
`35 debiting customer accounts (originator), microfilming, enve-
`lope stuffing and final delivery back by mail to the origina-
`tor. Errors can and do occur at every stage of the process. An
`error may result in a liability which equals or exceeds the
`value of the transaction, as well as subjects the maker of the
`40 error to regulatory sanctions. Thus, only a very low error rate
`is tolerated.
`In today's world, it is sometimes inconceivable that the
`cash itself never passes hands but can be electronically
`transferred or exchanged, while the document underlying
`45 the transfer must move from one bank to the next and cannot
`be electronically transferred. While this may not be the case
`for electronic funds transfers (which are controlled by
`special legislation and do not typically involve the use of
`checks), clearly check transfer processes are antiquated and
`50 cannot utilize the wealth of electronic data transfer mecha-
`nisms unless the integrity of the paper itself is maintained.
`Thus, any system which is employed to improve the effi-
`ciency of the check handling and clearing process should
`maintain the integrity of the information to legal standards,
`55 and also meet customer's demands for reliability, efficiency
`and acceptability.
`In 1988 the Board of Governors of the Federal Reserve
`System stated that "the benefits of a nationwide electronic
`presentment system would not be sufficient to outweigh the
`costs of a nationwide system." Furthermore, the Board
`recommended, that the focus of such a system should be on
`image processing to expedite moving the payment through
`the system. It was made clear that by the use of image
`interchange, the operational and transportational expenses
`would be greatly reduced. Likewise, through truncation
`processing, the amount of stored and transmitted informa-
`tion can be minimized. Thus, the benefits of a check trun-
`
`FIELD OF THE INVENTION
`
`The present invention relates to the field of automated
`image processing, image compression and pattern
`recognition, as well document-image storage and retrieval,
`and more particularly to financial instrument processing to
`provide efficient storage and retrieval of check image infor-
`mation and detection of fraudulent and authentic
`instruments, as well as to management of databases con-
`taining document-image information, and provision of
`merge/purge techniques to eliminate redundancies and
`errors.
`
`BACKGROUND OF THE INVENTION
`
`Financial Instruments
`
`In commercial and savings banking practice, monetary
`transfers often involve documents that include standard,
`preprinted information (backgrounds, logo's , icons, repeti-
`tive patterns, fields and the like) as well as post-printed
`information (handwritten entries, names, addresses and the
`like) that render the item negotiable or representative of a
`legal, binding contract. These items are documents that
`comprise forms with added information, and include, e.g.,
`checks, deposit and withdrawal slips, coupons, travelers'
`cheques, letters of credit, monetary instruments, food
`stamps, insurance forms, title documents, official govern-
`ment forms, tax forms, medical forms, real estate forms,
`inventory forms, brochures, information forms, application
`forms, questionnaire forms, laboratory data forms and the
`like. It is generally desirable to automatically extract rel-
`evant information from a form in order to assist in the
`processing of that information.
`A check is a negotiable instrument, which is signed by the
`maker or drawer, indicates a sum certain of money (or other
`specified value), a date to pay, and a direction to a bank or
`financial institution to pay to the order of the payee or to
`bearer on demand. The check thus generally has certain
`information or indications preprinted on it, information
`which is added to customize the check for the drawer and the
`payor bank, and information unique to each check written.
`In order for the bank to pay on the item, a check is generally
`first endorsed on the reverse side upon tender. Processing
`institutions in the international banking collection and settle-
`ment process will typically each stamp the check with
`identifying information, and also provide status relating to 60
`dishonor or abnormal circumstances. In normal banking
`procedures, the paper check passes from the maker or
`drawer to the payee, who then deposits the check with the
`payee bank The paper check is then cleared, for example,
`through a central clearinghouse of one or more banks, and 65
`is sent to the payor bank. While the funds themselves
`typically do not, in a physical sense, pass hands, but rather
`
`Page 20 of 43
`
`MINDGEEK EXHIBIT 1005
`
`
`
`5,668,897
`
`4
`3
`cation system are clearly taught in the art. However, past
`1) Correlation approach—a traditional approach encom-
`analyses have indicated that such systems are expensive. (cid:9)
`passing signal processing and statistical decision theory
`While the use of truncation processing can minimize the
`concepts.
`information to be transmitted and stored, past systems have
`2) Feature matching approach—whereby pixel-by-pixel
`generated a relatively large file for each document so 5
`intensity variations are ignored in favor of selected
`processed, so that this burden is not considered trivial.
`measurable features or attributes and relations of an
`It is known in the art of digital data storage and compres-
`image, e.g. texture or color.
`sion to compress data by compiling code libraries of infor-
`3) Relational matching—where detailed correspondences
`mation in a digital data source file, with a code library
`between the images include geometric relationships
`derived from the data to be compressed or with a code 10
`between selected components. This provides for mod-
`library having a content based on a predicted likely infor-
`elling of an entire image and leads to more efficient
`mation content of the source file, resulting in a compressed
`further processing by being able to prioritize the land-
`file if the source file is represented as a series of pointers to
`marks depending upon their particular semantic sig-
`portions of the code library, when the code library contains
`nificance. See Pratt, W. K., Digital Image Processing
`enough sequences in common with the source file. Thus, the 15
`and Fischler, M. & Firschein, 0. Intelligence, The Eye,
`series of pointers to the code library can be represented by
`The Brain and The Computer for more information on
`a smaller information content signal than the data source file
`image matching techniques.
`itself. Further, it is known that such code libraries may be (cid:9)
`When features are used as a means for matching, various
`adaptive and updated to include information from a digital
`te choose a with its bo oks can be cread, each (cid:9)
`
`own set of features.
`data source file, which may be repeated elsewhere, thereby
`
`20 Thus, the algorithm canThu
`codebook depending upon
`effecting a lower data storage requirement. Code libraries
`the features of the document and therefore dramatically
`may also be purged of information which does not appear in (cid:9)
`reduce the search time.
`files to be compressed. A limited size code library offers two (cid:9)
`Model-based data compression is also a known concept.
`advantages: first, it limits the search time to match a
`In a model-based compression system, certain characteris-
`sequence in the source file with a sequence in the code
`and second, it limits the size of an individual pointer 25 tics of the data to be compressed are presumed or predicted.
`library;
`In a model-based system. ordinarily, an expert studies the
`and therefore allows the compiled series of pointers to have
`characteristics of the type of data, and designs a codebook
`an optimum length. (cid:9)
`optimized for the expected data. The information content of
`A match between a stored template and a scanned image (cid:9)
`the data may be substantially reduced by taking into
`con-
`is rarely exact, e.g. because of noise introduced when
`sideration these characteristics common to a significant
`scanning or skew of the scanned image that may have been
`
`portion of the data stream. Therefore, in such a system, it is
`introduced when handling the paper document. This is so (cid:9)
`the difference between the data signal to be compressed and
`even when the template is only a portion of the input image
`the model or a selected model encompassed by the system
`or even when the match is based on features rather than pixel
`which forms the relevant data to be further processed. Of
`values. Consequently, a frequent method of matching a 35
`course, if a model completely describes the source data, then
`template is done by using a distance measure, d(m,n), (cid:9)
`the compressed data consists of merely an identification of
`between the template and the image at all points in the image
`the model. Various methods are also known to account for
`field. The input image is deemed to be a match whenever the
`deviations from the model which are insignificant, without
`distance is less than a preestablished threshold (X). The
`substantially increasing the amount of information which is
`distance function D(m,n,I,T) is computed at a variable 40
`necessary in order to describe the source data. Therefore, a
`starting point in the input image I against template image T.
`model-based system may include one or more models which
`Because of the skew or noise, the search of the input image
`characterize prototypic data, and an unknown signal is then
`may be at some localized area for a matching template.
`matched to a selected model, and processed to eliminate
`Thus, ftj,k) can denote the input image to be searched and
`information included in the selected model.
`T(j,k) the template image sought, where the search is
` It is further known that a large number of images or
`45
`constrained over some region of I(m,n), of the image where
`compressed images may be stored in a storage device. These
`0 (cid:9)
`and 0 (cid:9)
`for example. The pixels are then (cid:9)
`images may be used as templates for a pattern recognition
`index points of the image as a range over a matrix. By way (cid:9)
`system, for matching an unknown pattern against the images
`of example, the index can start at the lower left most pixel
`of an image as the position (0,0) in a typical coordinate 50 in the database. The storage medium may be RAM, ROM,
`EPROM, EEPROM, flash memory, magnetic storage
`system. One common distance function used is where the
`medium, magneto-optic storage medium, digital optical
`difference is defined as: D(in,n.I.TKALI(j+m,K+m)—T(j,
`
`storage medium holographic image storage medium an , (cid:9)
`,
`
`K)] 2. A template match exists at coordinate location (m,n) if: (cid:9)
`optical storage medium and other known systems. The
`D(m,n,I.T)<X.
`images stored in these databases may provide a very large
`Since many templates exist in the database, B(I) is 55 number of templates or models against which an image or
`donated as the closest matching template for a database of (cid:9)
`data pattern is to be matched, and statistical analysis may be
`templates, F={x1 where x is a template} and is defined as (cid:9)
`used to select a best match.
`B(I)=x, where xeF, and D(m,n,I,x) is a minimum The (cid:9)
`Automated handwriting extraction from documents and
`matching of the templates is complicated by a number of (cid:9)
`recognition thereof is also known. Handwriting recognition
`problems, e.g. shifts, rotational differences or scale 60 may be used for computer information input. Known optical
`differences, when pixel-by-pixel processing is necessary. It (cid:9)
`character recognition systems are available to read and
`is therefore often important to spatially register the two (cid:9)
`interpret handwriting. Systems are also available to extract
`images to correct for these problems. Many techniques are (cid:9)
`handwritten information from electronic images of forms.
`known in the art that deal with image registration. Such
`Database Management
`techniques improve upon the template matching process.
`A number of image matching techniques are known and (cid:9)
`Merging and coalescing multiple sources of information
`used in the art. Generally, these fall into three categories: (cid:9)
`into one unified database requires more than structurally
`
`65 (cid:9)
`
`30
`
`Page 21 of 43
`
`MINDGEEK EXHIBIT 1005
`
`(cid:9)
`(cid:9)
`(cid:9)
`(cid:9)
`
`
`5,668,897
`
`5
`integrating diverse database schema and access methods. In
`applications where the data is corrupted, i.e. is incorrect,
`ambiguous, having alternate forms or has changed over
`time, the problem of integrating multiple databases is par-
`ticularly challenging. This is known as the merge/purge
`problem. Merging information requires so-called semantic
`integration, which requires a means for identifying equiva-
`lent or similar data from diverse sources. The merging
`process must then determine whether two pieces of infor-
`mation or records are of sufficient similarity, and that they
`represent some aspect of the same domain entity, by means
`of sophisticated inference techniques and knowledge of the
`domain.
`A very large databAse is one in which it is unfeasible to
`compare each record with every other record in the database,
`for a given operation. Therefore, a simplifying presumption
`is necessary in order to ensure the integrity of the data
`records, such as when a batch of new records is added to the
`database. In general, this presumption is that a predeter-
`mined subset of the database records may be selected in
`which a cross comparison of the records within the subset
`will be effective to ensure the integrity of the entire database,
`to within a reasonable limit.
`In the field of mailing list verification, the database
`integrity is generally ensured by first sorting the database
`according to a criteria, then selecting a window of consecu-
`tive sorted records, and then comparing the records within
`the window with each other. The purpose is to eliminate
`duplicate records, so that within the window, records which
`appear to correspond are identified as such, and an algorithm
`is executed to select a single record as being accurate and to
`eliminate any other corresponding records. This known
`method, however, will not eliminate records which are
`corresponding and yet are not present within the window.
`Further, the comparison algorithm may not perfectly identify
`and eliminate duplicate records. This problem will exist with
`respect to financial documents, like checks and other items,
`as well.
`Known very large database systems may be maintained
`and processed on mainframe-class computers, which are
`maintained by service bureaus or data processing depart-
`ments. Because of the size of these databases, among other
`reasons, processing is generally not networked, e.g. the data
`storage subsystem is linked directly to the central processor
`on which it is processed and directly output.
`Other database processing methods are known, however
`these have not been applied to very large databases. This is
`not a matter of merely database size, but rather magnitude.
`In general, the reason for ensuring the integrity of a mailing
`list database is a matter of economics, e.g. the cost of
`allowing errors in the database as compared to the cost of
`correcting or preventing errors. Of course, when these
`databases are employed for other applications, the "cost" of
`errors may be both economic and non-economic. Often,
`databases are maintained for many purposes, including
`mailing list, and thus the costs may be indeterminate or
`incalculable.
`The semantic integration problem, see ACM SIGMOD
`record (December 1991), and the related so-called instance-
`identification problem, see Y. R. Wang and S. E. Madnick,
``The inter-database instance identification problem in inte-
`grating autonomous systems", Proceedings of the Sixth
`International Conference on Data Engineering (February
`1989), as applied to very large databases are ubiquitous in
`modern commercial and military organizations. As stated
`above, these problems are typically solved by using main-
`
`25
`
`6
`frame computing solutions. Further, since these organiza-
`tions have previously implemented mainframe class
`solutions, they typically have already made a substantial
`investment in hardware and software, and therefore will
`5 generally define the problem such that it will optimally be
`addressed with the existing database infrastructure.
`Routinely, large quantities of information, which may in
`some instances exceed one billion database records, are
`acquired and merged or added into a single database
`10 structure, often an existing database. Some of the new data
`or information to be merged from diverse sources or various
`organizations might, upon analysis, be found to contain
`irrelevant or erroneous information or be redundant with
`preexisting data. This irrelevant, erroneous or redundant
`15 information is purged from the combined database.
`Once the data is merged, other inferences may be applied
`to the newly acquired information; e.g. new information
`may be gleaned from the data set. The ability to fully analyze
`the data is expected to be of growing importance with the
`20 coming age of very large network computing architectures.
`The merge/purge problem is closely related to a multi-
`way join over a plurality of large database relations. The
`simplest known method of implementing database joins is
`by computing the Cartesian product, a quadratic time
`process, and selecting the relevant tuples. It is also known to
`optimize this process of completing the join processing by
`sort/merge and hash partitioning. These strategies, however,
`assume a total ordering over the domain of the join attributes
`30 or a "near perfect" hash function that provides the means of
`inspecting small partitions (windows) of tuples when com-
`puting the join. However, in practice, where data corruption
`is the norm, it is unlikely that there will be a total ordering
`of the data set, nor a perfect hash distribution. Known
`35 implemented methods nevertheless rely on these presump-
`tions. Therefore, to the extent these presumptions are
`violated, the join process will be defective.
`The fundamental problem is that the data supplied by the
`various sources typically includes identifiers or string data
`40 that are either erroneous or accurate but different in their
`expression from another existing record. The "equality" of
`two records over the domain of the common join attribute is
`not specified as a "simple" arithmetic predicate, but rather
`by a set of equational axioms that define equivalence, thus
`45 applying an equational theory. See S. Tsur, "PODS invited
`talk: Deductive databases in action", Proc. of the 1991
`ACM-PODS: Symposium on the Principles of Database