throbber
(12) United States Patent
`Gilchrist et al.
`
`USOO6404907B1
`US 6,404,907 B1
`(10) Patent No.:
`Jun. 11, 2002
`(45) Date of Patent:
`
`(54) METHOD FOR SEQUENCING NUCLEIC
`ACDS WITH REDUCED ERRORS
`
`(75) Inventors: Rodney D. Gilchrist, Oakville; James
`M. Dunn, Scarborough, both of (CA)
`(73) Assignee: Visible Genetics Inc., Toronto (CA)
`(*) Notice:
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/345,613
`(22) Filed:
`Jun. 25, 1999
`Related U.S. Application Data
`(60) Provisional application No. 60/090,887, filed on Jun. 26,
`1998.
`(51) Int. Cl. .................................................. G06K 9/00
`(52) U.S. Cl. ....................... 382/129; 382/173; 382/190;
`435/6
`(58) Field of Search ................................ 435/6; 436/86,
`436/89; 364/500; 392/129, 173, 190
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`4,811,218 A 3/1989 Hunkapiller et al.
`4,865,968 A 9/1989 Orgel et al.
`5,124.247 A 6/1992 Ansorge
`5,273,632 A 12/1993 Stockham et al.
`5,308,751 A 5/1994. Ohkawa et al.
`5,710,628 A
`1/1998 Waterhouse et al.
`5,712.476 A
`1/1998 Renfrew et al.
`5,733,729 A * 3/1998 Lipshutz ........................ 435/6
`5,776,737 A
`7/1998 Dunn
`5,786,142 A 7/1998 Renfrew et al.
`5,849.542 A * 12/1998 Reeve et al................ 435/91.1
`5,853.979 A 12/1998 Green et al.
`5,916,747 A 6/1999 Gilchrist et al.
`6,083,699 A * 7/2000 Leushner et al. .............. 435/6
`6,195,449 B1 * 2/2001 Bogden et al. ............. 382/129
`FOREIGN PATENT DOCUMENTS
`
`EP
`EP
`EP
`GB
`WO
`WO
`WO
`
`86.104822.1
`911182442
`93250264.4
`892.5772.9
`WO 9741259
`WO 98OO708
`WO 9811258
`
`10/1986
`3/1992
`4/1994
`11/1989
`11/1997
`1/1998
`3/1998
`
`OTHER PUBLICATIONS
`Dear and Staden, A Sequence assembly and editing program
`for efficient management of large projects, Nucleic Acids
`Research, vol. 19, No. 14, 3907–3911, Oxford University
`Press, 1991.
`
`Plaschke, Voss, Hahn, Ansorge, and Schackert, Doublex
`Sequencing in Molecular Diagnosis of Hereditary Diseases,
`BioTechniques 24:838–841, May 1998.
`John M. Bowling, et al., “Neighboring Nucleotide Intera
`tioncs. During DNA Sequencing Gel Electrophoresis”, 1991
`Oxford University Press, Nucleic Acids Research, vol. 19,
`No. 11.
`R.L. Miller and T. Ohkawa, “Chain Terminator Sequencing
`of Double-Stranded DNAWith Build-in Error Correction',
`Genereal Atomics Project 4422, Jul. 1991.
`C. Tibbits, et al., “Neural Networks For Automated Base
`calling of Gel-based DNA sequencing Ladders”.
`Allan M. Maxam and Walter Gilbert, “A New Method For
`Sequencing DNA”, Proc. Natl. Acad. Sci. USA, Vol. 74, No.
`2, pp. 560-564, Feb. 1977, Biochemistry.
`Stefan Wiemann, “Simultaneous On-Line DNA Sequencing
`on Both Strands with Two Fluorescent Dyes', Analytical
`Biochemistry 224, 117-121 (1995).
`Ulf Landegren, et al., “DNA Diagnostics-Molecular Tech
`niques and Automation”, Science, vol. 242, Oct. 14, 1988.
`Lance B. Koutney and Edward S. Yeung, "Automated Image
`Analysis for Distortion Compensation in Sequencing Gel
`Electrophoresis”, 1369 Applied Spectroscopy, 46(1992)
`Jan., No. 1, Frederick, M.D. US, vol. 46, No. 1, 1992.
`James B. Golden, III, et al., “Pattern Recognition for Auto
`mated DNA Sequencing: I. On-line Signal Conditioning and
`Feature Extraction for Basecalling”.
`Michael C. Giddings, et al., “An Adaptive, Objective Ori
`ented Strategy For Base Calling. In DNA Sequence Analy
`sis'', 4530–4540, Nucleic Acids Research, 1993, vol. 21, No.
`19. 1993 Oxford University Press.
`* cited by examiner
`Primary Examiner Marianne P. Allen
`(74) Attorney, Agent, or Firm-Oppedahl & Larson LLP
`(57)
`ABSTRACT
`Nucleic acid polymers are Sequenced by obtaining forward
`and reverse data Sets for forward and reverse Strands of a
`Sample nucleic acid polymer. The apparent base Sequences
`of these forward and reverse Sets are determined and the
`apparent Sequences are compared to identify any deviations
`from perfect complementarity. Any Such deviation presents
`a choice between two bases, only one of which is correct. A
`confidence algorithm is applied to the peaks in the data Sets
`asSociated with a deviation to arrive at a numerical confi
`dence value for each of the two base choices. These confi
`dence values are compared to each other and to a predeter
`mined threshold, and the base represented by the peak with
`the better confidence value is assigned as the “correct' base,
`provided that its confidence value is better than the thresh
`old. The confidence value takes into account at least one, and
`preferably more than one of Several Specific characteristics
`of the peaks in the data Set that were not complementary.
`
`9 Claims, 6 Drawing Sheets
`
`Oxford, Exh. 1011, p. 1
`
`

`

`U.S. Patent
`
`Jun. 11, 2002
`
`Sheet 1 of 6
`
`US 6,404,907 B1
`
`SEISV78 OOÇ
`
`BSV78 OG/
`
`
`
`
`
`WAB}}{EWN O H
`
`Oxford, Exh. 1011, p. 2
`
`

`

`U.S. Patent
`
`Jun. 11, 2002
`
`Sheet 2 of 6
`
`US 6,404,907 B1
`
`Oo
`w
`
`Oo
`t+ WwW
`a.
`
`>k
`
`t
`
`N
`
`Q
`T
`_
`4
`=
`+O
`4
`te ©
`EL
`5
`=
`
`T
`
`oo <
`
`mM wo
`
`+
`
`oO
`
`~ <j ©
`
`oO
`
`<
`
`Q
`
`dy
`
`1
`
`0
`
`
`
`
`
`
`LL
`Ty
`10°
`MN
`uw
`oO
`
`,t
`
`k
`Zz
`td
`oO
`wa
`tu
`a
`
`oO
`oO
`
`oO
`oO
`oO
`t
`oO
`oO
`NOILOSLAG 40 ADVYNDOV
`
`Oxford, Exh. 1011, p. 3
`
`Oxford, Exh. 1011, p. 3
`
`

`

`US 6,404,907 B1
`
`€Old
`
`U.S. Patent
`
`Jun. 11, 2002
`
`
`
`Sheet 3 of 6
`
`
`
`LVILOOLLOLIONVOOLVVODLVVSODLVOD-VVLVOLLSVOOLVOSVOLVLLIVOOLOLOVLOLO99AWIdd&
`
`
`
`
`
`LYDLODLLOLOOOVOOLVVODLYWIOLIVODOVVLVOLLOVIDIVOOVILVLLIVIOLILIVIN1999AWldds
`
`
`
`
`
`IVOLOLLLOLMOSVSOLYWOOFVOOLVIDOVVLVIOLLOVOOLVOOVOLVLIIVODLOLOVLSIG99=SONSuaa3y
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
` W9199|LOLOD9VNO.LWWOO.LWVOOLWOOOVVLVLLV9OLV9OVOLVLLIWSDLOLIVLOLINOASLIAYYOO
`
`Oxford, Exh. 1011, p. 4
`
`Oxford, Exh. 1011, p. 4
`
`

`

`Jun. 11, 2002
`
`Sheet 4 of 6
`
`US 6,404,907 B1
`
`
`
`
`
`U.S. Patent ~~~~~z:::::::xx)
`
`Oxford, Exh. 1011, p. 5
`
`

`

`
`
`0G=QIOHSSAYHLSONSCISNOO*
`
`US 6,404,907 B1
`
`GOld
`
`U.S. Patent
`
`Jun. 11, 2002
`
`Sheet 5 of 6
`
`xANIVASONAGI4NOD
`
`SONS34S5uy
`
`
`
`ceIVNSISLIWNOSISLXa4loeLXdL
`
`LIANSSsy
`
`
`
`qD
`
`Dq
`
`aq
`
`q_
`
`q_
`
`Oxford, Exh. 1011, p. 6
`
`Oxford, Exh. 1011, p. 6
`
`
`
`
`

`

`
`
`
`
`
`
`
`
`AATON
`Kex\F
`ARTY9
`
`ASYSASY
`
` QHYVMYOS
`G\O.9DLADOV7VVSK/NEItogWK/VY
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 11, 2002
`
`Sheet 6 of 6
`
`US 6,404,907 B1
`US 6,404,907 B1
`
`
`
`(SpuodasSUl4414SBLNjosqoaud]OFauD7)
`
`
`
`
`(SPUuOdaSUlLJIYSB4NjOSGDsudOFBUD7)
`
`
`
`
`
`
`
`
`
`
`
`
`
`SHIOqg850gSA44IYSaLnjosqyaudOFsUu07
`
`
`
`
`
`SAIOd3S0GSA4HIYSSynjosqySsuo7Ofaud}
`
`
`
`
`
`
`
`
`
`
`
`
`Oxford, Exh. 1011, p. 7
`
`Oxford, Exh. 1011, p. 7
`
`
`
`
`

`

`1
`METHOD FOR SEQUENCING NUCLEIC
`ACIDS WITH REDUCED ERRORS
`
`US 6,404,907 B1
`
`2
`better numerical confidence value, provided that the
`numerical confidence value is better than the threshold.
`The confidence algorithm takes into account at least one,
`and preferably more than one of Several Specific character
`istics of the peaks in the data Sets that were not complimen
`tary.
`
`BRIEF DESCRIPTION OF THE DRAWING
`The invention will be described with respect to a drawing
`in Several figures, of which:
`FIG. 1 shows four regions of the HIV-1 genome
`Sequenced in the analysis of HIV according to the invention;
`FIG. 2 shows the improvement in accuracy in Selecting
`one of two HIV species using both forward and reverse
`Strands,
`FIGS. 3 and 4 show a comparison of text files represent
`ing apparent base sequences (3 Prime Sequence in FIG.
`3=SEQ ID No. 1, 5 Prime sequence in FIG.3=SEQ ID No.
`2, Reference sequence in FIG. 3=SEQ ID No. 3.3 Prime
`sequence in FIG. 4=SEQ ID No. 4, 5 Prime sequence in FIG.
`4=SEQ Id No. 5, Reference sequence in FIG.4=SEQ ID No.
`6, and Corrected sequence in FIG.4=SEQ ID No. 7);
`FIG. 5 shows a schematic representation of the deviation
`between forward (SEQ. ID No. 8) and reverse (SEQ. ID No.
`9) sequences is observed; and
`FIG. 6 shows sequence data for forward and reverse
`Strands in which regularity/evenness of peak Separation can
`be used as a key characteristic in determining a numerical
`confidence value.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`The purpose of the present invention is to provide a novel
`method and System for the reduction of errors in Sequencing
`data, and in particular to provide a method and System which
`can automate the process of reconciling forward and reverse
`Strand Sequences to readily provide Sequencing results of
`improved quality.
`In the present disclosure, the invention is illustrated using
`Sequence data taken from the TruGene HIV-1 ASSay manu
`factured by Visible Genetics Inc. In this case, data traces
`containing Sequence information for one amplicon from the
`Protease region and three amplicons from the reverse tran
`scriptase (RT) region as shown in FIG. 1 were considered.
`The reference to this Sequence is provided for purposes of
`example only, however, and to demonstrate the efficacy of
`the invention. Thus, in a broader Sense, the present invention
`may be applied to the Sequencing and error correction of
`Sequencing data for any polynucleotide, including DNA and
`RNA sequences for any gene or gene fragment.
`Error rates in HIV mutation Sequencing are in the range
`of 5 errors/1000 bases sequenced or higher for many home
`brew sequencing methods (single Strand). Using the method
`of the invention these rates are Substantially reduced to
`provide error rates that routinely are as low as 5 errors/100,
`000 bases and may reach levels as low as 2.5/1000000 bases
`for a 300 base call. FIG. 2 shows the improvement in
`accuracy in detecting one of two HIV Species using both
`forward and reverse Strands.
`The method of determining the Sequence of a Sample
`polynucleotide in accordance with the invention involves the
`following basic Steps:
`(a) obtaining forward and reverse data sets for the sample
`polynucleotide;
`(b) identifying the sequence of bases within the forward
`and reverse data Sets;
`
`This application claims priority from U.S. Provisional
`Appl. No. 60/090,887, filed Jun. 26, 1998, which is incor
`porated herein by reference.
`BACKGROUND OF THE INVENTION
`During routine Sequencing of DNA from Samples (such as
`HIV genotyping after RT-PCR conversion from RNA to
`DNA), normally only one strand (forward or reverse) of the
`DNA is actually Sequenced. In this case, the researcher must
`decide whether the output signal, and the resulting basecall
`is accurate based on their experience and skill in reading
`Sequence Signals. If the Signal and resulting basecall is of
`questionable reliability, then the researcher must start the
`Sequencing run again in the hope of obtaining a better Signal.
`In Some cases, the forward and reverse Stands are both
`sequenced, such as by using two dyes on a MICROGENE
`CLIPPER sequencer manufactured by Visible Genetics Inc.
`Forward and reverse Strand Sequencing provides the
`researcher with more information and allows the researcher
`to evaluate the quality and reliability of the data from both
`Strands. If the bases on both Strands complement each other
`as expected, then this helps to confirm the reliability of the
`Sequence information. However, in Some instances, after the
`Signal data from Sequencing is assigned a base (e.g. A, C, G
`or T), the corresponding base on the opposite Strand does not
`match. If the Signal and resulting basecall is of questionable
`reliability, then the researcher must start the Sequencing run
`again in the hope of obtaining better Signal. Alternatively,
`the researcher might manually review ("eyeball” analysis)
`the Signal data from both the forward and reverse Strands and
`make a decision on which Strand's data was more reliable.
`Unfortunately, any such decision will vary between indi
`vidual researchers and can lead to inconsistent determination
`of reliablity within the same Sequencing run. Furthermore,
`this kind of eyeball analysis requires Special training which
`makes it poorly Suited for application in routine diagnostic
`applications.
`It would therefore be desirable to have a method for
`Sequencing nucleic acid polymers in which discrepancies
`can be resolved using automated procedures, i.e. using
`computerized data analysis. It is an object of the present
`invention to provide Such a method, and an apparatus for
`performing the method.
`
`15
`
`25
`
`35
`
`40
`
`45
`
`SUMMARY OF THE INVENTION
`In accordance with the invention, nucleic acid polymers
`are Sequenced in a method comprising the Steps of
`(a) obtaining forward and reverse data sets for forward
`and reverse Strands of the Sample nucleic acid;
`(b) determining the apparent Sequence of bases for the
`forward and reverse data Sets;
`(c) comparing the apparent forward and reverse Sequences
`of bases for perfect complementarity to identify any
`deviations from complementarity in the apparent
`Sequence, any Such deviation presenting a choice
`between two bases, only one of which is correct;
`(d) applying a confidence algorithm to peaks in the data
`Set associated with a deviation to arrive at a numerical
`confidence value; and
`(e) comparing each numerical confidence value to a
`predetermined threshold and Selecting as the correct
`base the base represented by the peak which has the
`
`50
`
`55
`
`60
`
`65
`
`Oxford, Exh. 1011, p. 8
`
`

`

`3
`(c) comparing the sequence of bases within the forward
`and reverse data Sets to identify any deviations from
`perfect complementarity in the Sequences as deter
`mined for the two sets, and
`(d) applying a confidence algorithm to each deviation to
`Select the correct base from between the choices pre
`Sented by the identified forward and reverse Sequence.
`A variety of procedures for obtaining the forward and
`reverse data Sets for the Sample polynucleotide are known,
`and all can be applied in the present invention. In general,
`the Sample polynucleotide or a complementary copy of the
`Sample polynucleotide is combined with a Sequencing
`primer which is extended using a template-dependent poly
`merase enzyme in the presence of a chain-terminating
`nucleotide triphosphate (e.g. a dideoxynucleotide) to pro
`duce a set of Sequencing fragments the lengths of which
`reflect the positions of the base corresponding to the
`dideoxynucleotide triphosphate in the extended primer. By
`preparing one set of fragments for each type of base (e.g. A,
`C, G and T), the complete Sequence for the Sample poly
`nucleotide is determined. Forward and reverse Sequences are
`obtained by utilizing two primers which hybridize to the two
`strands of a duplex DNA molecule.
`The preparation of fragment mixtures providing forward
`and reverse Sequencing data Sets can be performed as
`individual reactions, or it can be concurrent. In a concurrent
`procedure, forward and reverse primers with different labels
`are extended in the same reaction mixture. This proceSS can
`involve a single extension cycle as disclosed by Wiemann et
`al., Anal. Biochem 224: 117-121 (1995), or multiple
`bi-directional cycles (preferably using CLIPTM sequencing
`chemistry, Visible Genetics Inc.) as described in Interna
`tional Patent Publication No. WO 97-41259 entitled
`"Method for Sequencing of nucleic acid polymers’ each of
`which are incorporated herein by reference. The proceSS can
`also involve multiple bi-directional cycles as described in
`U.S. patent application Ser. No. 09/009,483,now issued as
`U.S. Pat. No. 6,083,699, incorporated herein by reference to
`the extent permitted. Thus, fragment mixtures reflecting the
`Sequence of the forward and reverse Strands of the same
`polynucleotide are obtained by multiple cycles of a primer
`extension reaction in which two differently and distinguish
`ably labeled primers are extended in the presence of chain
`terminator nucleotides in a single reaction mixture. Pre
`ferred fragment mixtures utilize fluorescent labels which are
`detected following electrophoretic Separation to produce a
`forward and reverse data trace for each base position.
`The next step in the method of the invention is the
`identification of the apparent base Sequence for both the
`forward and reverse Strands of the Sample polynucleotide, a
`proceSS Sometimes referred to as “base-calling.” The proceSS
`of base-calling is theoretically quite Straightforward, requir
`ing nothing more than the Sequential reading of the bases
`from the overlapping data traces to produce a list of bases
`reflecting the Sequence. In practice, the process is more
`complicated, because of departures of actual data from the
`theoretical ideal. AS for the initial generation of the data
`traces, there are various methods known for dealing with
`these complications to facilitate automated base-calling
`from real data, including those disclosed in U.S. Pat. Nos.
`5,365,455 and 5,502,773, which are incorporated herein by
`reference.
`A preferred base-calling technique is that disclosed in
`U.S. Pat. No. 5,853,979 entitled “Method and system for
`DNA sequence determination and mutation detection with
`reference to a standard” and International Patent Publication
`WO 97-02488 entitled “Method and system for DNA
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,404,907 B1
`
`15
`
`25
`
`4
`Sequence determination and mutation detection,” each of
`which are incorporated herein by reference. In this method,
`a fragment pattern representing the positions of a Selected
`nucleic acid base within the polymer as a function of
`migration time or distance is evaluated to determine one or
`more “normalization coefficients.” These normalization
`coefficients reflect the displacement, Stretching or shrinking,
`and rate of Stretching or shrinking of the fragment pattern,
`or Segments thereof, which are necessary to obtain a Suitably
`high degree of correlation between the fragment pattern and
`a Standard fragment pattern which represents the positions of
`the Selected nucleic acid base within a Standard polymer
`actually having the known Sequence as a function of migra
`tion time or distance. The normalization coefficients are then
`applied to the fragment pattern to produce a normalized
`fragment pattern which is used for base-calling in a con
`ventional manner.
`The process of comparing the experimental fragment
`pattern and the Standard fragment pattern to arrive at nor
`malization coefficients can be carried out in any number of
`ways without departing from the present invention. In
`general, Suitable processes involve consideration of a num
`ber of trial normalizations and Selection of the trial normal
`ization which achieves the best fit in the model being
`employed. It will be understood, however, that the theoreti
`cal goal of achieving an exact overlap between an experi
`mental fragment pattern and a Standard fragment pattern
`may not be realistically achievable in practice, nor are
`repetitive and time consuming calculations to obtain perfect
`normalization necessary to the Successful use of the inven
`tion. Thus, when employing this method to facilitate base
`calling, the term “high degree of normalization” refers to the
`maximization of the normalization which is achievable
`within practical constraints. As a general rule, a point-for
`point correlation coefficient calculated for normalized frag
`ment patterns and the corresponding Standard fragment
`pattern of at least 0.8 is desirable, while a correlation
`coefficient of at least 0.95 is preferred.
`The result of the base-calling is two lists of bases, one for
`the forward strand and one for the reverse strand. Each list
`contains an indication of the base at a particular location
`(e.g. one-letter indications A. C., G and T). In Some
`instances, a list may include one or more blankS. Such
`blanks are inserted by the alignment program to maximize
`the extent of alignment and take into account the fact that
`insertions or deletions within one Strand may result in a shift
`of one portion of the Strand relative to the corresponding
`portion of the other strand. These two lists are suitably stored
`in a data processor performing the Sequence analysis as text
`files. The next step is the comparison of these two text files
`to determine whether there are any deviations from the
`theoretically expected perfect complementarity. This com
`parison process can be performed by any of Several methods.
`Common to these methods is the appropriate alignment of
`the text listings of bases to a common Starting point. This
`alignment involves an iterative testing of various alignment
`options to arrive at the best alignment. Iterative routines for
`accomplishing this alignment have been disclosed by
`Needleman et al., “A general method applicable to the
`Search for Similarities in amino acid Sequences of two
`proteins” J. Mol. Biol. 48: 443–453 (1970) and Smith et al.,
`“The identification of common molecular Subsequences' J.
`Mol. Biol. 147: 195-197 (1981).
`In a first approach, the text file listings of the forward
`Strand and the reverse Strand are each aligned with a text file
`listing of a Standard Sequence for the Sample polynucleotide
`being sequenced (eg. HIV-1 wild-type Sub-type B in the case
`
`Oxford, Exh. 1011, p. 9
`
`

`

`S
`of the HIV example discussed below). Alternatively, the text
`file listings for the forward and reverse Strands can be
`aligned to each other. It will be appreciated that the use of
`text files is only one option, however, and that the alignment
`may occur between the experimental data Sets, or between
`the experimental data Sets and a reference data Set.
`The important aspect is that the alignment Step produce
`information which will allow a determination of whether or
`not there is a deviation in the Sequence of the forward and
`reverse Strands from the expected complementarity. When
`Such a deviation is detected, the method of the invention
`provides an automated System for Selecting between the
`options presented and generating a “correct Sequence. This
`Selection proceSS can take place in Several StepS using a
`confidence algorithm.
`The confidence algorithm is used to assign a confidence
`value to each base in the forward or reverse text listing that
`is not confirmed by the other listing. The confidence value
`is a measure of the likelihood that a particular base identified
`in a text listing is the correct base. The confidence algorithm
`determines the confidence value for a peak by taking into
`account a variety of factors which reflect the quality of the
`data traces. Specific factors include:
`1. Separation distance between peaks,
`2. regularity/evenness of peak separation;
`3. peak height compared to neighbors (higher confidence
`if similar);
`4. peak area compared to neighbors (higher confidence if
`Similar);
`5. distance to neighbors compared to the local average
`distance to neighbors;
`6. resolution of the peak (lower confidence for lower
`resolution); and
`7. Signal-to-noise ratio in the region around the peak
`(lower confidence as the peak's size is more similar to
`the noise level).
`The number of characteristics and the particular charac
`teristics considered are a matter of design choice which is
`driven by the performance of the combination of chemistry
`and instrumentation which is used. In Some Systems, it may
`be the case that a few characteristics (e.g. two) are particu
`larly Sensitive to the causes of error, in which case deter
`mination of a numerical confidence value based on these
`characteristics is Sufficient.
`In a preferred embodiment, all of these factors are
`included in a weighted combination to arrive at the confi
`dence value, although the use of less than all of the factors
`may be considered, particularly where two factors are simi
`lar (Such as peak height and peak area). The confidence
`value is also lowered in Some recognized special cases:
`peak is a heterozygote;
`more than two overlapping peaks; or
`the peak is Small compared to its neighbors.
`The System evaluating the data traces may also attempt to fit
`groups of peaks to the Signal when the peaks are low
`resolution. These fitted peaks are also assigned confidence
`values using the above.
`The Specific weighting applied to each factor will vary to
`Some extent with the configuration of the Sequencing instru
`ment employed and the chemistry used since each experi
`mental combination will tend to produce different variability
`which affects the accuracy of the base call. Thus, for
`example, Some Sequencing chemistries are prone to greater
`variability in peak height than others, Such that variations in
`peak height might be of leSS Significance in the confidence
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,404,907 B1
`
`6
`algorithm. Initial determination of the appropriate weighting
`for a given System can be done using multiple calibration
`runs with a known Sequence and varying the weight given to
`the different factors to arrive at the most consistent and
`error-free results. In a preferred embodiment, the weights
`given to the different factors are updated heuristically as
`experimental Sequences are determined.
`Once the weights to be given to the confidence factors are
`determined, an overall numerical confidence value is calcu
`lated for each peak which indicates deviations from the
`expected match between the forward and reverse Sequences.
`This calculated confidence value is then compared to a
`predetermined threshold value to determine whether the
`confidence value is Sufficiently high (assuming that the
`characteristics are combined Such that a larger number is
`indicative of high confidence) to accept the base as being
`correct. It will be appreciated that the numerical value of this
`threshold will depend on many factors, including the units of
`the measurements used for the individual factors and the
`level of rigor which the individual user of the invention
`chooses to apply. Thus, it is not possible to give meaningful
`numerical examples of a threshold value. Preferably the
`threshold value should, however, be one which when applied
`in combination with the selected weights for the various
`factors to Standard sequence (Such as M13) produces error
`rates of less than 1/1,000 bases over the first 300 bases of the
`region Sequenced.
`It will be appreciated that the form of the confidence
`algorithm can be manipulated Such that a “good” result will
`be either high or low. When the algorithm is such that a
`“good” result is large, then the numerical confidence value
`is “better then the threshold if it exceeds the threshold.
`When the algorithm is Such that a "good” result is Small,
`then the numerical confidence value is better if it is less than
`the threshold. Similarly, in comparing the two numerical
`confidence values, the one that is “better is one that is larger
`in the first circumstance and Smaller in the Second.
`The application of the confidence values to the actual
`Sequences is Suitably performed in Several Successive StepS.
`If the forward and reverse Sequences do not confirm each
`other, then if a base exists (as opposed to a blank) in both the
`forward and reverse experimental Sequences and the confi
`dence measure of the better of the two is above the confi
`dence threshold currently set, then that base with the higher
`confidence measure is assigned. If both bases are above the
`confidence threshold, an additional comparison to the ref
`erence Sequence may be carried out, with the base which is
`the same as the reference being Selected in this instance.
`Identity with the reference Sequence is not a basis for
`Selection as the correct base in an experimental Sequence in
`the absence of a Sufficient confidence value.
`If only one of the experimental Sequences has a base at the
`location of the deviation and there is a base (as opposed to
`a blank) in the reference Sequence and the identified base in
`the experimental Sequence is above the confidence
`threshold, then the base from the experimental Sequence is
`used as the “correct” base (the base type in the reference
`Sequence is ignored, only the spacing information is used).
`If there is a blank in the reference Sequence and that is
`confirmed by either of the experimental Sequences, then a
`blank should be put in the corrected Sequence. If none of the
`above conditions apply, it is recommended to put an 'N' in
`the output sequence (Standard letter denoting all bases are
`present) and mark the location as uncorrected.
`The method of the invention is preferably carried out in an
`apparatus or System running appropriate computer code. The
`apparatus or System comprises at least a data processor
`
`Oxford, Exh. 1011, p. 10
`
`

`

`7
`operably programmed to perform the Steps of identifying the
`Sequence of bases within the forward and reverse data Sets;
`comparing the Sequence of bases within the forward and
`reverse data Sets to identify any deviations from perfect
`complementarity in the Sequences as determined for the two
`Sets, and applying a confidence algorithm to each deviation
`to Select the correct base from between the choices presented
`by the identified forward and reverse Sequence. The appa
`ratus or System further comprises means for obtaining
`forward and reverse data Sets for the Sample polynucleotide.
`In the case of an integrated System, this may be a direct data
`feed from an electrophoresis apparatus connected to the data
`processor. In a distributed System, the data Sets can be
`obtained via a connection on a local area network (LAN), a
`wide area network (WAN), by modem or cable modem
`transmission or by insertion of a portable Storage medium
`(diskette, tape etc.) into a drive capable of reading the
`portable Storage medium. The apparatus or System further
`comprises means for providing useful output of the deter
`mined Sequence. This may be as a video display or as a
`Sequence listing Stored on a storage medium Such as a disk
`drive or read/write CD-ROM.
`FIGS. 3 and 4 show screen output from an Intel(R)
`processor-based Hewlett-Packard Vectra VL computer
`(running an OpenStep Mach operating System) in which one
`amplicon of the RT region is being corrected for Sequencing
`errors by analyzing both Strands (shown as 3 prime (text 1)
`and 5 prime (text 2)). The highlighted bases on the reference
`show places of disagreement between the two strands (text
`1 and text 2). The highlighted text in the Corrected area
`represent corrections according to the above embodiment of
`
`15
`
`25
`
`US 6,404,907 B1
`
`8
`the present invention. In this case, the Software allows
`adjustment of the confidence threshold, which as shown is
`Set to 80%.
`FIG. 5 shows a chart indicating possible outcomes if
`basecall for a particular base is not confirmed by each Strand.
`In this case, if the base in text 1 and text 2 are a, then they
`have confirming sequence (e.g. forward Strand is A and
`reverse strand is T). N/Ameans that the confidence value for
`a particular base for either strand was below threshold (50%
`in this figure) and correction was not possible.
`FIG. 6 shows sequence data for the forward and reverse
`Strand in which the parameter of “regularity/evenness of
`peak Separation' has been measured as lane-to-lane shift in
`Seconds (y-axis) with the base location on the X-axis. The
`reverse Strand shows an erratic shift in lane timing and
`indicates a compression in Signals which is unreliable. In
`contrast, this parameter shows more regular peak Separation
`in the forward strand which may provide for an overall
`higher confidence value for bases on the forward Strand.
`In the course of evaluating the method and System of the
`invention with HIV sequence data, it has been observed that
`errors in the Sequence data for one Strand tend to be random
`with respect to errors in the Sequence data of the opposite
`Strand. As a result, it is less likely that an error will occur in
`the Sequence data of both Strands at the same base position.
`These errors can thereby be corrected should the reliability
`of one strand (confidence) be of a sufficient level. This
`observation highlights the utility of the present invention to
`create corrected Sequence data in HIV genotyping, and it is
`believed that comparable benefits will be obtained for
`Sequencing in general.
`
`SEQUENCE LISTING
`
`<160> NUMBER OF SEQ ID NOS: 9
`
`<21 Oc
`<211
`<212>
`<213>
`<22O >
`<223>
`
`SEQ ID NO 1
`LENGTH 66
`TYPE DNA
`ORGANISM: Human immunodeficiency virus
`FEATURE
`OTHER INFORMATION: 3 Prime Reverse Transcriptase region
`
`<400 SEQUENCE: 1
`
`ggctgtact g to catttatc aggatggagt to ataac cat coaatggaat g gaggct citt
`
`60
`
`gct gat
`
`66
`
`SEQ ID NO 2
`LENGTH 67
`TYPE DNA
`ORGANISM: Human immunodeficiency virus
`FEATURE
`OTHER INFORMATION: 5 Prime Reverse Transcriptase region
`
`<400 SEQUENCE: 2
`
`ggctgtact g to catttatc aggatggagt to at aaccoa to caatggaa t ggaggctct
`
`60
`
`tgctgat
`
`67
`
`SEQ ID NO 3
`LENGTH 67
`TYPE DNA
`ORGANISM: Human immunodeficiency virus
`FEATURE
`OTHER INFORMATION: Reverence Reverse Transcriptas

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket