`(12) Patent Application Publication (10) Pub. No.: US 2006/0129523 A1
`(43) Pub. Date:
`Jun. 15, 2006
`Roman et al.
`
`US 2006O129523A1
`
`(54)
`
`DETECTION OF OBSCURED COPYING
`USING KNOWN TRANSLATIONS FILES AND
`OTHER OPERATIONAL DATA
`
`(76)
`
`Inventors: Kendyl Allen Roman, Sunnyvale, CA
`(US); Paul Raposo, San Francisco, CA
`(US)
`
`Correspondence Address:
`KENDYL A ROMAN
`730 BARTEY COURT
`SUNNYVALE, CA 94087 (US)
`
`(21)
`
`Appl. No.:
`
`11/299.529
`
`(22)
`
`Filed:
`
`Dec. 12, 2005
`
`(60)
`
`Related U.S. Application Data
`Provisional application No. 60/635,908, filed on Dec.
`10, 2004. Provisional application No. 60/635,562,
`filed on Dec. 13, 2004.
`
`Publication Classification
`
`(51) Int. Cl.
`(2006.01)
`G06F 7/30
`(52) U.S. Cl. .................................................................. 707/1
`(57)
`ABSTRACT
`Systems and methods that automatically compare sets of
`files to determine what has been copied even when sophis
`ticated techniques for hiding or obscuring the copying have
`been employed. The file compare system comprises a file
`compare program that uses various operational data and user
`interface options to detect illicit copying, highlight and align
`matching lines, and to produced a formatted report. A known
`translations file is used to match translated tokens. Other
`operation data files specify rules that the file program then
`used to improve its results. The generated report contains
`statistics and full disclosures of the known translations used
`and the other methods used in creating the exhibits. The
`system includes a bulk compare program that automatically
`detects likely file pairings and candidates for validation as
`known translations, which can be used on iterative runs. The
`user is given full control in the final output and the system
`automatically reforms the reports and recalculations the
`statistics for consistent and accurate final presentation.
`
`180
`
`100
`
`User
`Interface
`Options
`
`File A
`
`
`
`1 10
`
`160
`
`
`
`File
`Compare
`
`150
`
`Formatted
`Report
`
`
`
`
`
`
`
`
`
`
`
`Operational
`Data
`
`
`
`140
`
`Instacart, Ex. 1035
`
`1
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 1 of 27
`
`US 2006/0129523 A1
`
`180
`
`User
`Interface
`Options
`
`100
`
`A1
`
`File A
`
`
`
`1 10
`
`160
`
`
`
`File
`Compare
`
`Formatted
`Report
`
`150
`
`
`
`
`
`
`
`
`
`Operational
`Data
`
`
`
`140
`
`Fig 1
`
`2
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 2 of 27
`
`US 2006/0129523 A1
`
`ii include Kst dio.h>
`
`// The quick brown fox jumped over the lazy dog. How many tries did it take?
`int dog Height, int jump Increment)
`void tries (int initial Fox Jump Height,
`
`int jumpHeight = initial Fox Jumpheight;
`int numTries = 0;
`
`while (jump Height < dog Height)
`
`jumpHeight + F jump Increment;
`numTries + +;
`
`printf("Number of tries: %d\n", numTries) ;
`
`File:jump.c
`
`Fig. 2A
`
`3
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 3 of 27
`
`US 2006/0129523 A1
`
`include <stdio.h>
`
`// A fast auburn wolf leaped above a passive canine. How many attempts did it
`take?
`void attempts (int start Wolf LeapHeight, int canine Height, int leap Increment)
`{
`
`int leapHeight as start Wolf LeapHeight;
`int numberOf Attempts = 0;
`while (leapHeight < canine Height)
`{
`
`leapHeight + = leap Increment;
`numberOfAttempts++;
`
`}
`printf("Number of attempts : & d \n", numberOfAttempts) ;
`
`File: leap.c
`Fig 2B
`
`4
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 4 of 27
`
`US 2006/0129523 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`2300
`
`2310
`
`
`
`
`
`
`
`2338
`2340
`
`2300a
`
`2300b
`
`Fig 2C
`
`5
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 5 of 27
`
`US 2006/0129523 A1
`
`1.
`Exhibit. 2d
`WaViump.c
`
`40
`Z 8
`
`iia. ...
`
`iii
`
`Ifan
`
`2
`3.
`
`s
`
`8
`3.
`
`i:
`i3
`14
`5
`
`uic: Lorily fox imple: over the laz
`it take?
`is lifox Jureieight,
`
`, Hiw
`
`Iapieit:
`it.
`irst nir is a ;
`
`initial Fix Jixie.ht;
`
`while ki; ;ssie?t k disfielist
`
`iliarieight +s in Ici Aerit;
`nutries + -
`
`i
`
`priiti ("liter of tries
`
`d\r", nur? fries i
`
`WBV leap.c
`
`2410
`
`:
`2
`3.
`
`s
`
`gic ice cacts:
`
`17 A fast. iiibu in wolf sapieci aixow e s passive ca: irie.
`ilt
`itsary atte
`did it take?
`vic atterists
`
`i; it ca::itiehei,
`
`irst leapisight s starts f.?eacheicine;
`it, rurasarofatterests at 9:
`
`while lea:Height ... ca: nir:e Height:
`
`.spieight -= encreisent:
`ruitberift teiisits++;
`
`intf("Nicer cf atterests: civi", nitier fatters
`
`li
`12
`
`3
`
`i
`
`te: ;
`
`in or code,
`23 E. : :
`
`2432
`- 2436
`
`= 7.4
`2438 - Filtered
`The following transiation equivalents les-1
`found auci used in highlightig this file
`
`
`
`2430
`
`2450
`
`2402
`1.
`Confidelitiality Legend
`
`2404
`
`vertative
`triest attempts
`browns a burn
`CQCsCallie
`2452
`diagHeights canineHeight
`auicks fast -
`jumpheight-leapHeight
`LinpIncrements leapincrement
`umped seaped
`numrics truncer of Attapta
`Exhibit 2D
`
`2406 - of 2
`
`2400a
`
`2400b
`
`Fig 2D-1
`
`6
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 6 of 27
`
`US 2006/0129523 A1
`
`2400
`24.08
`
`Pete
`
`2410
`
`VBV leap. c.
`lazy-passive
`initial Fox:Jumpheights startWolf LeapHeight
`fox=wolf
`
`2450
`
`fate: During formatting tabs are converted to four spaces
`and all lines longer than 53 characters are wrapped. All
`wrapped lines are denoted with a
`character at the
`beginning of the line; however, highlighting is based on
`the full line prior to formatting.
`2460
`
`2402
`Confidentiality Legend
`
`2404
`
`Exhibit 2D
`
`2406 -2 of 2
`
`2400a
`
`2400b
`
`Fig 2D-2
`
`7
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 7 of 27
`
`US 2006/0129523 A1
`
`3100
`
`3102
`
`3104
`
`Read File A
`
`1. 3106
`
`Read File B
`
`11
`3 110
`
`3.108
`
`3 112
`
`Read Operational Data
`Files
`
`3114
`
`31 16
`
`See Fig 3B
`
`Compare Files
`
`3 118
`
`3.120
`Calculate Similarities
`
`See Fig 3D
`
`324
`
`NO
`
`3132
`
`3.134
`
`3122
`
`Similarity >
`Threshold?
`
`
`
`3126
`
`330
`
`Output
`Reports
`
`Fig 3A
`
`8
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 8 of 27
`
`US 2006/0129523 A1
`
`3200
`
`
`
`More Lines
`in File B?
`
`Look Back & Identify
`Out of Order Matches
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Yes
`
`3208
`
`See
`3216 Fig 3C
`3218
`Mark Matching Lines
`
`Find Next Match
`
`3210
`
`3237
`
`
`
`
`
`Matches
`Found?
`
`
`
`NO
`
`
`
`3226
`
`Mark Pending Lines
`of Files A and B
`
`3228
`
`3230
`
`Final Look Back &
`Identify Out of Order
`3232
`
`
`
`Do Remaining Lines
`of File A
`
`Fig 3B
`
`9
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 9 of 27
`
`US 2006/0129523 A1
`
`3300
`
`3302 -
`
`Increment Offsets &
`Block Sizes
`3338
`
`3340
`
`3308
`
`No
`
`Offset >Start
`of File A
`
`Get & Tokenize Next
`Line of File B
`
`33 12
`Determine Significant
`Tokens
`
`3314
`
`
`
`
`
`* - 3346
`3344
`Get & Tokenize
`Previous Lines of
`Both Files
`
`3348
`
`3350
`
`No
`
`336
`
`1. 3342
`
`Do Tokens
`Match?
`
`3352
`
`
`
`
`
`
`
`Any
`Significant?
`
`No
`
`3320
`
`Yes
`
`Yes - 3356
`Adjust Both Offsets &
`Block Sizes
`
`3334
`
`3358
`
`3374
`
`3372
`Increment Block Sizes
`
`3364
`
`Get & Tokenize Next
`Lines of Both Files
`
`3326
`Get & Tokenize Next
`Line of File A
`
`
`
`No
`
`
`
`Any Tokens
`Match?
`
`3330
`
`Yes
`
`
`
`
`
`Fig 3C
`
`3376
`
`No
`
`3378
`
`10
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 10 of 27
`
`US 2006/0129523 A1
`
`Start
`
`3400
`
`3402
`
`3404
`
`Append Stats Line to
`Stats File
`
`3406
`Open Output Files
`
`3408
`
`3410
`
`Output Formatted
`Headers
`
`3414
`
`Output Formatted
`File A Body
`3418
`
`Output Formatted
`File B Body
`3422
`
`Output Compare
`Statistics
`
`3426
`
`Close Files
`
`3412
`
`3416
`
`3420
`
`3424
`
`3428
`
`3430
`
`Finish
`
`3432
`
`Fig 3D
`
`11
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 11 of 27
`
`US 2006/0129523 A1
`
`4
`80
`
`
`
`User
`Interface
`Options
`
`160
`
`482
`
`File
`Compare
`
`
`
`43
`1 O
`166
`
`Formatted
`Report
`
`150
`
`110
`
`120
`
`File A
`
`File B
`
`162
`
`464
`
`Statistics
`
`452
`
`New
`PoSSible
`Translations
`
`454
`
`Translation
`Used
`
`456
`
`Filtered
`Translations
`
`458
`
`
`
`
`
`
`
`
`
`
`
`Known
`Translations
`
`442
`
`Suspected
`Translations
`
`Exclusions
`
`
`
`
`
`
`
`
`
`Obscured
`Lines
`
`448
`
`Language
`Specific
`
`470
`
`Language
`Keywords
`
`472
`
`400
`
`Fig. 4
`
`12
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 12 of 27
`
`US 2006/0129523 A1
`
`# include <stdio.h>
`
`// The quick brown fox jumped over the lazy dog. How many tries did it take?
`void tries (int initial Fox Jumpheight, int dog Height,
`int jump Increment)
`
`int jumpheight = initial Fox JumpHeight;
`int numTries = 0;
`
`while (jumpHeight < dog Height)
`{
`
`jumpHeight += jump Increment;
`numTries + +;
`
`}
`printf("Number of tries: %d\n", numTries) ;
`
`// Verify jump
`while (numTries > 0)
`
`jumpheight -= jump Increment;
`numTries--;
`
`if (jumphleight == initial Fox Jumpheight)
`{
`
`printf(" - Verified \n");
`
`File: jumpVerify.c
`
`Fig 5A
`
`13
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 13 of 27
`
`US 2006/0129523 A1
`
`# /usr/local/bin/perl5
`
`# A fast auburn wolf leaped above a passive canine. How Inany attempts did it take?
`Sub Attempts (S startWolf LeapHeight, Scanine Height, SleapIncr)
`
`Sleap Height = $startWolf LeapHeight;
`SnumberOf Attempts = 0;
`
`while (SleapHeight < Scanine Height) // MvP
`A / MvP
`// MvP
`// MvP
`// MvP
`
`SleapHeight + = Sleapincr;
`SnumberOfAttempts++;
`
`}
`
`printf("Number of Attempts: %d\n", SnumberOf Attempts);
`
`f / Confirm leap
`for (; SnumberOfAttempts > 0; SnumberOfAttempts--)
`
`SleapHeight -= Sleap Incr;
`
`}
`print " - Verified \n" if (SleapHeight == $startWolf LeapHeight) ;
`
`File: leapConfirm.pl
`Fig 5B
`
`14
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 14 of 27
`
`US 2006/0129523 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`A
`
`Known Translations
`
`5300
`
`5340
`
`s.4
`
`5300a
`
`5300b
`
`Fig 5C
`
`15
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 15 of 27
`
`US 2006/0129523 A1
`
`
`
`
`
`5400
`Suspected Translations
`Original Words
`Translation Equivalents
`Verify
`5410a
`Confirm
`541 Ob
`
`
`
`A
`
`54.00a
`
`
`
`A.
`
`54OOb
`
`5410
`
`- 5412
`
`Fig 5D
`
`16
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 16 of 27
`
`US 2006/0129523 A1
`
`
`
`
`
`Exclusions
`
`5500
`
`
`
`\s' \/\/\sMvP's ul 5510a // MvP comment at the end of a line- 551Ob
`int
`- 5512a
`int (sp) anywhere on the line - 5512b
`
`5510
`5512
`
`5500a
`
`5500b
`
`Fig 5E
`
`17
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 17 of 27
`
`US 2006/0129523 A1
`
`
`
`
`
`Obscured Lines
`Block A
`Block B
`2- 56.10a | - 5610b - 5610c - 5610d
`
`
`
`*h
`
`3D- 5610e
`
`5600
`
`5610
`5612
`
`A
`5600a
`
`5600b
`
`5600c
`
`5600d
`
`A.
`5600e
`
`Fig 5F
`
`18
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 18 of 27
`
`US 2006/0129523 A1
`
`ul 2400
`24.08
`i.ity.
`sincurde stolio.h>
`
`23.
`
`imate:
`
`we the laz it
`
`ava, hiow
`
`inc dosaheight, i.
`
`inplinic: Eile::
`
`rintf("hillreiser of tries: dr", hurries: i.
`
`l
`2
`3.
`
`5
`
`VBV leapConfirm.pl
`
`2410
`f : fusr/local/bin/perl5
`
`is 35 & axxv. 3 passive ca: i.iv.,
`# A fast a laxii; W., f
`ow Italy atterficts sid it take
`Sstatic fieaeieight, Scanisaeig:
`
`i.
`
`if
`
`eight = initiallox
`
`eight
`
`printf(" - verified \n") i
`
`- Werified Vin" if
`rint
`O?capeight) i.
`
`liceapHeight
`
`-
`
`}
`
`:
`2
`3.
`
`
`
`
`
`
`
`
`
`22
`23
`24
`25
`2.
`
`2402
`1.
`Confidentiality Legend
`
`2434
`2438
`
`2404
`
`Escod, T 2432
`copied
`i5 - 65.22.
`2436
`fbscuri
`3 - 13. H
`Filtered
`0 a
`The following translation equivalents wege1
`found and used in highlighting this fillic:
`Exhibit SD
`2406 - 1 of 2
`
`2430
`
`2400a
`
`Fig 5G-1
`
`2400b
`
`19
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 19 of 27
`
`US 2006/0129523 A1
`
`ul 2400
`
`Exhibit 5)
`VAWumpwerify.c
`
`2408
`
`Confidentiality Legend
`
`- 2402
`A.
`
`2400a
`
`Vav leapconfirm.pl.
`
`2410
`
`ff
`dog Height=Scanine Height
`upHeight-Seapleight
`junipricements Sleapinch
`ruin":ries=Shurberof Attempts
`initial Foxupheight=SstartWolf LeapHeight
`The A
`triests Attempts
`verifyi-Confirm
`the=a
`overa above
`triest attempts
`
`
`
`2450
`
`dor callirie
`quicks fast- 24.52
`imprleap
`umped-leaped
`lazy-passive
`2460
`void-sub
`/
`foxws wolf
`ote: During formatting tabs are convected to for spaces
`and all lines longer than 53 characters are wrapped. All
`wrapped lines are denoted with a
`character at the
`beginning of the line; however, highlighting is based or
`the full line prior to fortatti rig.
`Also the following tokens were ignored during comparison
`if these files;
`5772
`int (sp} anywhere on the line
`A f 4ve comment at the end of &N5774
`
`5770
`
`5768
`
`2404
`
`Exllibi 5D
`
`2406 - 2 of 2
`
`A.
`
`2400b
`
`Fig 5G-2
`
`20
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 20 of 27
`
`US 2006/0129523 A1
`
`
`
`
`
`
`
`610
`
`600
`<
`
`Bulk User Interface
`Options
`
`2
`
`680
`
`
`
`630
`
`652
`
`Bulk Statistics
`
`668
`Possible Translations
`
`v
`
`100 or 400
`
`Fig 6
`
`21
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 21 of 27
`
`US 2006/0129523 A1
`
`700
`File Pair Combinations
`le
`710a B
`lar 7 Ob
`B2
`B3
`B
`B2
`B3
`B1
`B2
`B3
`B1
`B2
`B
`
`
`
`Ali
`
`A
`Al
`A2
`A2
`A2
`A3
`A3
`A3
`A
`A4
`A
`
`700a.
`
`700b
`
`|
`
`710
`712
`714
`716
`718
`720
`722
`724
`726
`728
`730
`732
`
`- 740
`
`- 742
`
`- 744
`
`- 746
`
`Fig 7
`
`22
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 22 of 27
`
`US 2006/0129523 A1
`
`800
`
`810
`
`82
`
`Perform Bulk
`Compare
`14
`8
`Analyze Statistics
`
`816
`
`818
`
`
`
`820
`
`Expert Review:
`Select Known Translations
`Determine File Pairing
`
`822
`
`826
`
`850
`
`Yes
`
`824
`
`830
`
`832
`
`No
`
`860
`
`Fig 8
`
`834
`
`Perform
`File Compare
`
`23
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 23 of 27
`
`US 2006/0129523 A1
`
`900
`
`902
`
`Perform File
`Compare
`906
`
`Manually Modify
`Markup
`
`910
`
`834
`
`908
`
`912
`
`Reformat and
`Recalculate Statistics
`914
`
`Finish
`
`916
`
`F 9.
`
`24
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 24 of 27
`
`US 2006/0129523 A1
`
`
`
`
`
`
`
`
`
`
`
`
`150a
`
`150b
`
`
`
`File A
`Listing
`
`File B
`Listing
`
`150
`
`1000
`
`A1
`
`Formatted
`Listing A
`
`
`
`1010
`
`Formatted
`Listing B
`
`See Fig 11
`
`See Fig 12
`
`Fig 10
`
`25
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 25 of 27
`
`US 2006/0129523 A1
`
`1100
`Exhibit 2D-A.
`MAVuump. c.
`
`08
`
`g
`2
`3.
`4.
`5.
`5
`7
`8
`9
`lo
`
`12
`13
`14
`
`18
`
`tii clide as too. 2:Y
`
`Af. The quick trainfax tape aver the lasty dog. Floti ray tries did it take
`void tries int. initial Fox JunapHeight, int dogheight, it jump Increatient)
`at
`titpieight
`it it laioxJurophie light:
`it names as O:
`
`while juropellit.: diagHsight
`
`Jutpieight it
`numriest++;
`
`u(piriterrent:
`
`}
`
`print. "Nuret of tries; it's ni", numTries:
`
`u 1102
`Confidentiality Legend
`A.
`1100a
`
`1 104
`
`Exhibit 2D-A
`
`106 - 1 of 1
`
`Fig 11
`
`26
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 26 of 27
`
`US 2006/0129523 A1
`
`lau?, 100
`VBvleap, c
`108
`
`i ice Ksci
`
`3.
`4.
`t
`
`8
`s
`
`2
`3
`
`fast autarn self leaped allove a gassive ceinine Hagian? atteres i? it take?
`;
`vo is attempts rat startolfi.eepheight int canine Height it leap Increment)
`into leap Feight is starticleepeat:
`int number of Attempts a h;
`sile leapeight. K animeieight
`
`i.eapeight is leap Ierentent;
`rther Ofitteltips--:
`
`printf "urber of attempts: d n, number of Attaripts:
`
`ul 1102
`Confidentiality Legend
`A
`
`1 100a
`
`1 104
`
`kit. 2d-
`
`1 106 - 1 of 1
`
`Fig 12
`
`27
`
`
`
`Patent Application Publication Jun. 15, 2006 Sheet 27 of 27
`
`US 2006/0129523 A1
`
`See Fig 11
`
`See Fig 12
`
`1300
`
`
`
`
`
`
`
`
`
`1304
`Parse Compare File &
`Calculate Statistics
`
`1306
`Output File A Listing
`
`1308
`
`1310
`
`132
`
`Output File B Listing
`
`1316
`
`1 314
`Output Compare File
`with Updated Stats
`1318
`
`1320
`
`Fig 13
`
`28
`
`
`
`US 2006/0129523 A1
`
`Jun. 15, 2006
`
`DETECTION OF OBSCURED COPYING USING
`KNOWN TRANSLATIONS FILES AND OTHER
`OPERATIONAL DATA
`
`RELATED APPLICATIONS
`0001. This application claims priority under 35 U.S.C. S
`199(e) of the co-pending U.S. provisional application Ser.
`No. 60/635,908, filed Dec. 10, 2004, entitled “DETECTION
`OF OBSCURED COPYING USING KNOWN TRANSLA
`TIONS FILES AND OTHER OPERATIONAL DATA,
`which is hereby incorporated by reference.
`0002 This application claims priority under 35 U.S.C. S
`199(e) of the co-pending U.S. provisional application Ser.
`No. 60/635,562, filed Dec. 11, 2004, entitled “DETECTION
`OF OBSCURED COPYING USING KNOWN TRANSLA
`TIONS FILES AND OTHER OPERATIONAL DATA,
`which is hereby incorporated by reference.
`
`BACKGROUND FIELD OF THE INVENTION
`0003. This invention relates to systems and methods for
`comparing files to detect the use of copied information, and
`more particularly to such systems and methods that detect
`copying where the copying has been obscured by various
`techniques.
`
`BACKGROUND THE PROBLEM
`0004 We are in the midst of the Information Age. More
`and more people make their living as information workers.
`The technologies fueling the Information Age are still being
`developed at an intense rate. For example, during the last
`few decades there has been unprecedented development and
`growth in the use of the Internet. The Internet information
`space known as the World Wide Web has become a signifi
`cant tool for communications, commerce, research, and
`education. Almost all of this information is stored electroni
`cally in computer files, which can be easily copied, trans
`ferred anywhere in the world, and modified. At the same
`time, many have made extreme efforts to share in the
`fortunes to be made in this new era of computer based
`information and communication. Some of this has been
`evidenced by the “irrational exuberance' of the Internet
`boom.
`0005. Unfortunately, the ease of access to information
`and the ease at which information can be copied and
`modified, combined with both personal and corporate greed,
`has led to what appears to be unprecedented levels of illegal
`copying of copyrighted materials, including the computer
`programs that run on the computers of the information age
`and the information found on the World Wide Web. This
`illegal copying has led to numerous lawsuits claiming Fed
`eral copyright infringement and both Federal and state trade
`secret misappropriation. Significant trade secret theft can
`also lead to criminal prosecution.
`0006. At the same time, computer equipment has become
`more powerful and increased in storage capacity—both
`primary memory (RAM) and secondary storage (disk and
`tape drives). Computer programs, likewise, have grown in
`size and complexity. Some Software projects are comprised
`of tens of thousands of source code files, collectively con
`taining millions of lines of code. The source version control
`systems for those projects may contain billions of lines of
`
`code. The version control systems may also include other
`types of media including design documents, database sche
`mas, graphics files, and other data, all Subject to copyright
`and trade secret protection.
`0007. The courts are interested in the literal copying and
`use of the literal lines of code that make up these computer
`programs. Copyright extends to translations of the original
`work as well. Trade secrets can be copied without copying
`the literal lines of code. Literal copying and literal transla
`tion are direct evidence of copying. The courts have also
`said, “Where there is no direct evidence of copying, a
`plaintiff may establish an inference of copying by showing
`(1) access to the allegedly-infringed work by the defen
`dant(s) and (2) a substantial similarity between the two
`works at issue.” In determining Substantial similarity, the
`first step is to filter out those elements that were not
`protectable, namely those which are not original to the
`copyright holder or which required minimal creativity.
`0008 Also, the courts have recognized that a significant
`portion of the work and creative effort of developing com
`puter programs is found in tasks not limited to the actual
`writing of the lines of Source code, but include many layers
`of abstract design. This work includes understanding cus
`tomer and system requirements, designing external inter
`faces, designing internal interfaces, architecting the struc
`ture of the system and individual modules, developing
`abstract algorithms, coding, integration, testing, bug fixing,
`and maintenance. Because of this, the courts recognized
`copying of the non-literal aspects of the computer program
`as well.
`0009 Because of the highly technical nature of computer
`programming, the courts rely on technical experts to shed
`light on what was copied, whether the copying was allow
`able, and whether the copying was Substantial. The courts
`have provided various guidelines for determining non-literal
`copying. One guideline is to analyze the sequence, structure,
`and organization of the computer program. More recently,
`the courts are adopting an “abstraction-filtration-compari
`Son' test. In this test, first the computer program is broken
`down into layers of abstraction, second, the elements that are
`not protected are filtered out, and third, the remaining
`elements are compared against the alleged infringing work
`(at each of the levels of abstraction). The courts have been
`interested in the literal lines of code as well as more abstract
`aspects of the computer program, Such as the algorithms, the
`parameter lists, modules or files that make up each program,
`the database architecture, and the system level architecture.
`0010. The similarities at each of these levels can be
`shown by creating side-by-side listings of the copied mate
`rials. The various aspects of the comparison can be indicated
`with various types of formatting.
`0011. In trade secret cases, information that was general
`knowledge (as opposed to specific knowledge) or which is
`readily ascertainable must also be filtered.
`0012 However, in order to prepare the side-by-side list
`ings, the expert must first determine which pairs of files from
`the respective works to compare. Once a pair of files with
`Some level of copying has been found, the literal and
`non-literal aspects of the copying must be indicated in some
`manner. This can be done manually using a word processor,
`such as Microsoft Word brand or FrameMaker brand word
`
`29
`
`
`
`US 2006/0129523 A1
`
`Jun. 15, 2006
`
`processors. However, when there are tens of thousands of
`files and millions of lines of code it becomes-almost impos
`sible for an expert or group of experts to accurately find all
`instances of copying and to properly apply the filtering and
`formatting required for presentation to the judge and jury.
`Further, to qualify as a technical expert, the individual must
`have recognized experience and expertise in the computer
`Science, as well as the ability to present the information,
`testify, and overcome the challenges and rigors of the court
`room. Qualified individuals, who are at the peak of their
`careers and are in high demand, earn relatively high hourly
`compensation. A typical case may require hundreds or
`thousands of hours of analysis and exhibit preparation. The
`cost of doing the work manually can be prohibitive. Further,
`the volume of work can be difficult to perform error free.
`Any errors in the analysis or presentation can be used to
`challenge the reliability of the evidence and the credibility of
`the expert witness.
`BACKGROUND PRIOR ART
`0013 Software developers are aware of a number of code
`comparison tools associated with their development envi
`ronment. For example the UNIX brand development envi
`ronment has long had a utility known as “diff which
`compare lines of files for exact matching. The diffutility will
`produce output that indicates which block of lines are
`identical, which block of lines have been added, and which
`block of lines have been deleted. It is typical for an inte
`grated development environment (IDE), such as Microsoft
`Developer Studio brand, Microsoft SourceSafe brand,
`Metrowerks CodeWarrior brand, or Apple Xcode brand
`IDEs, to include a file compare utility. There are also
`stand-alone programs such as WinDiff brand or Helios
`Software Solutions TextPad brand file compare programs.
`Many of these programs provide the same comparison
`features as the original Unix brand diffutility. Some of these
`show lines added, changed and deleted with colored high
`lighting. Some include a graphical user interface that aligns
`identically matching lines of code in a side-by-side format
`that can be scrolled in a window.
`0014) However all of these diff-like programs are limited
`in detecting illegal copying because they only report lines
`that match exactly. Small insignificant changes can easily be
`made to each copied line and these diff-like programs will
`report that no lines are identical, giving a false indication
`that there is no copying.
`00.15
`Editing programs, such as Microsoft Word and
`those found in the various IDEs, have a feature that allows
`all the occurrences of a certain word or phrase to be changed
`(or translated) to a different word or phrase. For example
`every occurrence of “dog” could be translated to “canine'.
`This is known as “Change All or “global query/replace'.
`Software developers can easily generate a list of the impor
`tant names (or identifiers) in a computer program. Software
`developers with nefarious intent can easily develop a list of
`substitute words for each of those identifiers, and change
`every important name wherever it occurs throughout a set of
`copied files. In a matter of minutes the computer can make
`millions of changes to tens of thousands of files. The
`program would still be structured and behave identically
`even though none of the important lines of code would
`match identically.
`0016. These diff-like programs cannot detect such global
`changes.
`
`0017 Further, the diff program algorithm is limited. It
`can get confused in its comparison. If a block of code is
`copied but moved out of order, the diff program may fail to
`detect the identical lines simply because they have been
`rearranged within the file.
`0018. A software developer with nefarious intent can
`easily defeat the illegal copying detection capabilities of
`programs such as diff.
`
`BACKGROUND MORE SOPHISTICATED
`COPYING
`0019. A software developer who is attempting to copy a
`set of source code, and has some understanding that they
`cannot literally copy the Source code without detection, can
`employ various techniques to avoid literal copying that can
`easily be detected, while still effectively copying the source
`code. To avoid being caught, an illicit copier can employ
`more Sophisticated techniques to hide or obscure the evi
`dence of their illegal copying.
`0020. As discussed above, the easiest approach is to
`simply use an editor to make global changes throughout the
`code to identifiers such as variable and method names. This
`makes it difficult for conventional comparison programs to
`detect the copying.
`0021 Another approach is to add spaces, tabs, carriage
`returns, words or comments that don’t change the essential
`function of the code, but will defeat diff-like programs.
`0022. Another approach is to reorder the code so that the
`sections work the same but have been moved around to
`avoid side-by-side comparison.
`0023. Another approach is to re-write the same algo
`rithms in a different language, for example, translating from
`C to Visual Basic, from C to C++, from Basic to C++, and
`so forth.
`0024. Another approach is to rewrite every line of code
`using different but equivalent programming constructs. This
`makes individual line-by-line comparison impossible
`because the equivalent elements may be split across non
`contiguous lines.
`
`BACKGROUND MY EARLIER TESTING
`0025 I conceived of a basic technique to overcome and
`detect some of these techniques, such as the global change
`of important identifiers. I developed custom file compare
`test programs that read two files and broke the words and
`symbols of the files into individual elements called tokens.
`As I manually compared the files, I added special instruc
`tions and data into each different custom test program to
`reverse the global changes that had been made by the illicit
`copier. These programs also output a report where the two
`programs were presented side-by-side with line numbers.
`When these early test programs were successful in identi
`fying translated lines of code, the lines were lined up (or
`aligned) side-by-side by inserting extra blank lines. Lines of
`code that have been literally copied or translated were
`shown in red and are underlined. The lines were numbered
`with the original line numbers. Lines that were too long were
`truncated (cut off) so that the lines would still match up.
`0026. While these situation specific test programs vali
`dated this basic approach, and saved a significant amount of
`
`30
`
`
`
`US 2006/0129523 A1
`
`Jun. 15, 2006
`
`time preparing exhibits that could be edited by hand for
`completeness, it was clear that I had not yet developed a
`complete Solution that would meet the needs of general use
`over a wide range of situations.
`0027. One problem was that the translation rules and
`terms are built-in to each custom program. This required
`changes to the program each time a new rule or new
`matching pair of translation equivalents were found. The
`required repeated modification of the program resulted in
`multiple versions and constant changing of the program.
`0028. Another problem was that each project required its
`own custom program so that the program could never be
`finished. Another problem was maintaining a growing set of
`custom programs. It was difficult to fix software defects or
`to add general enhancements. A fix to one custom program
`might break another custom program that had a different set
`of features.
`0029 Further, testing with a broader range of test cases
`revealed that many techniques for hiding illicit copying were
`still not covered by these simple test programs. For example,
`a situation where the illicit copier added carriage returns,
`words or comments that didn't change the essential function
`of the code, still defeated my early test programs. Also, some
`programming environments include unique numbers on
`every line in a file. The simple act of copying the contents
`of a file into another file will cause every line to no longer
`match because of the unique numbers.
`0030. In some situations subsets of files, appearing in the
`same projects, were found to have been translated using
`different translations for the same words. My early test
`programs could not handle multiple translations of the same
`words.
`0031. Also, the process of finding pairs of files to be
`compared was still a time consuming manual process.
`0032. Further, once I produced a side-by-side listing with
`marking showing the lines that were copied, it was necessary
`to filter out, for example, lines that were in the public
`domain or which were generally known. In some cases, an
`employee of one of the parties may be the best domain
`expert to review what should be filtered versus what would
`be proprietary or trade secret information. However, often
`that person may be limited because of protective orders from
`seeing both sides of the comparison. There is a need to
`prepare marked up listings of either side of a side-by-side
`comparison, that is identical in markup and presentation to
`the side-by-side listings but which contains on the code from
`one of the parties.
`
`BACKGROUND SOLUTION NEEDED
`0033 What is needed is a comprehensive system that will
`automatically:
`0034 (a) find and mark literal copying
`0035 (b) find and mark literal translation
`0036) (c) filter material that should be filtered
`0037 (d) identify copied material that has been filtered
`0038 (e) calculate statistics on total lines, lines copied,
`lines obscured, lines filtered, and percentages
`0039 (f) identify translations that have been used
`
`(g) identify copying even when the code was
`004.0
`translated from one programming language to another
`0041
`(h) identify copying even when words and com
`ments have been changed without changing the essen
`tial function of the code
`0042 (i) provide a mechanism to identify copying
`even when the carriage returns were added
`0043 () provide a mechanism to exclude portions of
`each line prior to comparing the more meaning portions
`(e.g. exclude unique number of each line)
`0044) (k) determine which pairs of files should be
`compared
`0045 (1) skip pairs of files that have little or no
`similarity so that those that do have similarity can be
`presented sooner and with fewer resources
`0046 (m) identify possible translations that might not
`yet have become known
`0047 (n) apply customized rules based on observed
`technique for obscuring copying
`0048 (o