`Gilchrist et al.
`
`USOO6404907B1
`US 6,404,907 B1
`(10) Patent No.:
`Jun. 11, 2002
`(45) Date of Patent:
`
`(54) METHOD FOR SEQUENCING NUCLEIC
`ACDS WITH REDUCED ERRORS
`
`(75) Inventors: Rodney D. Gilchrist, Oakville; James
`M. Dunn, Scarborough, both of (CA)
`(73) Assignee: Visible Genetics Inc., Toronto (CA)
`(*) Notice:
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/345,613
`(22) Filed:
`Jun. 25, 1999
`Related U.S. Application Data
`(60) Provisional application No. 60/090,887, filed on Jun. 26,
`1998.
`(51) Int. Cl. .................................................. G06K 9/00
`(52) U.S. Cl. ....................... 382/129; 382/173; 382/190;
`435/6
`(58) Field of Search ................................ 435/6; 436/86,
`436/89; 364/500; 392/129, 173, 190
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`4,811,218 A 3/1989 Hunkapiller et al.
`4,865,968 A 9/1989 Orgel et al.
`5,124.247 A 6/1992 Ansorge
`5,273,632 A 12/1993 Stockham et al.
`5,308,751 A 5/1994. Ohkawa et al.
`5,710,628 A
`1/1998 Waterhouse et al.
`5,712.476 A
`1/1998 Renfrew et al.
`5,733,729 A * 3/1998 Lipshutz ........................ 435/6
`5,776,737 A
`7/1998 Dunn
`5,786,142 A 7/1998 Renfrew et al.
`5,849.542 A * 12/1998 Reeve et al................ 435/91.1
`5,853.979 A 12/1998 Green et al.
`5,916,747 A 6/1999 Gilchrist et al.
`6,083,699 A * 7/2000 Leushner et al. .............. 435/6
`6,195,449 B1 * 2/2001 Bogden et al. ............. 382/129
`FOREIGN PATENT DOCUMENTS
`
`EP
`EP
`EP
`GB
`WO
`WO
`WO
`
`86.104822.1
`911182442
`93250264.4
`892.5772.9
`WO 9741259
`WO 98OO708
`WO 9811258
`
`10/1986
`3/1992
`4/1994
`11/1989
`11/1997
`1/1998
`3/1998
`
`OTHER PUBLICATIONS
`Dear and Staden, A Sequence assembly and editing program
`for efficient management of large projects, Nucleic Acids
`Research, vol. 19, No. 14, 3907–3911, Oxford University
`Press, 1991.
`
`Plaschke, Voss, Hahn, Ansorge, and Schackert, Doublex
`Sequencing in Molecular Diagnosis of Hereditary Diseases,
`BioTechniques 24:838–841, May 1998.
`John M. Bowling, et al., “Neighboring Nucleotide Intera
`tioncs. During DNA Sequencing Gel Electrophoresis”, 1991
`Oxford University Press, Nucleic Acids Research, vol. 19,
`No. 11.
`R.L. Miller and T. Ohkawa, “Chain Terminator Sequencing
`of Double-Stranded DNAWith Build-in Error Correction',
`Genereal Atomics Project 4422, Jul. 1991.
`C. Tibbits, et al., “Neural Networks For Automated Base
`calling of Gel-based DNA sequencing Ladders”.
`Allan M. Maxam and Walter Gilbert, “A New Method For
`Sequencing DNA”, Proc. Natl. Acad. Sci. USA, Vol. 74, No.
`2, pp. 560-564, Feb. 1977, Biochemistry.
`Stefan Wiemann, “Simultaneous On-Line DNA Sequencing
`on Both Strands with Two Fluorescent Dyes', Analytical
`Biochemistry 224, 117-121 (1995).
`Ulf Landegren, et al., “DNA Diagnostics-Molecular Tech
`niques and Automation”, Science, vol. 242, Oct. 14, 1988.
`Lance B. Koutney and Edward S. Yeung, "Automated Image
`Analysis for Distortion Compensation in Sequencing Gel
`Electrophoresis”, 1369 Applied Spectroscopy, 46(1992)
`Jan., No. 1, Frederick, M.D. US, vol. 46, No. 1, 1992.
`James B. Golden, III, et al., “Pattern Recognition for Auto
`mated DNA Sequencing: I. On-line Signal Conditioning and
`Feature Extraction for Basecalling”.
`Michael C. Giddings, et al., “An Adaptive, Objective Ori
`ented Strategy For Base Calling. In DNA Sequence Analy
`sis'', 4530–4540, Nucleic Acids Research, 1993, vol. 21, No.
`19. 1993 Oxford University Press.
`* cited by examiner
`Primary Examiner Marianne P. Allen
`(74) Attorney, Agent, or Firm-Oppedahl & Larson LLP
`(57)
`ABSTRACT
`Nucleic acid polymers are Sequenced by obtaining forward
`and reverse data Sets for forward and reverse Strands of a
`Sample nucleic acid polymer. The apparent base Sequences
`of these forward and reverse Sets are determined and the
`apparent Sequences are compared to identify any deviations
`from perfect complementarity. Any Such deviation presents
`a choice between two bases, only one of which is correct. A
`confidence algorithm is applied to the peaks in the data Sets
`asSociated with a deviation to arrive at a numerical confi
`dence value for each of the two base choices. These confi
`dence values are compared to each other and to a predeter
`mined threshold, and the base represented by the peak with
`the better confidence value is assigned as the “correct' base,
`provided that its confidence value is better than the thresh
`old. The confidence value takes into account at least one, and
`preferably more than one of Several Specific characteristics
`of the peaks in the data Set that were not complementary.
`
`9 Claims, 6 Drawing Sheets
`
`Oxford, Exh. 1011, p. 1
`
`
`
`U.S. Patent
`
`Jun. 11, 2002
`
`Sheet 1 of 6
`
`US 6,404,907 B1
`
`SEISV78 OOÇ
`
`BSV78 OG/
`
`
`
`
`
`WAB}}{EWN O H
`
`Oxford, Exh. 1011, p. 2
`
`
`
`US. Patent
`
`Jun. 11,2002
`
`Sheet 2 0f 6
`
`US 6,404,907 B1
`
`o
`L0
`
`mn
`0v
`
`.
`>—
`F.
`
`”
`_
`
`o
`a
`g
`
`N
`
`_g§
`_
`t—CD
`E:
`S
`2
`
`'"
`
`"'
`
`LL
`
`__00
`NLLI
`o
`
`T
`
`o
`
`
`
`<>O<1
`
`$07010
`
`+ 1
`
`0
`
`0‘.
`
`o4 o
`
`o
`
`4
`
`<I
`
`.—
`
`zL
`
`IJ
`
`oa
`
`:
`0.1
`0.
`
`
`
`
`o
`9
`
`o
`O
`o
`<r
`to
`co
`NOIiOBiEG :10 AOVHDOOV
`
`Oxford, Exh. 1011, p. 3
`
`Oxford, Exh. 1011, p. 3
`
`
`
`US. Patent
`
`Jun. 11, 2002
`
`Sheet 3 0f 6
`
`US 6,404,907 B1
`
`
`
`.20HUOHFOFQoo<oOH<<OOF<<UoH<OUI<<H<01754Duk<oo<obqkkkdookoko<FOPWOOMEEQm
`
`
`
`CupsOPHOHOOo/uoOP<<o©bq<oobqooo<<P<okPO<OQH<OO<OH<F._.._.<o0.6.ro4.5,...000m._>__mmm
`
`
`
`
`
`
`
`P<0HOEFHOHEOo<OOP<<o§<<u0.30004<H<OFFO<GOF<®O<o._.<.E{_.<OOHOHo<Pofioomozmmmmmm
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
` EaselsOHHOHOWO<OOP<QOOH<<OQF<OOQ¢<F<OHFO<OOF<OO<UF<PHH<OQHOFO<._.OF000Dmkowmmoo
`
`m.OI
`
`Oxford, Exh. 1011, p. 4
`
`Oxford, Exh. 1011, p. 4
`
`
`
`Jun. 11, 2002
`
`Sheet 4 of 6
`
`US 6,404,907 B1
`
`
`
`
`
`U.S. Patent ~~~~~z:::::::xx)
`
`Oxford, Exh. 1011, p. 5
`
`
`
`US. Patent
`
`Jun.11,2002
`
`Sheets 0f6
`
`US 6,404,907 B1
`
`*m34<>mozmoizoo
`
`mozmmmmmm
`
`
`
`N44205r4<zo_mkaE.
`
`N._.Xm._.
`
`_.HXm:
`
`HJDwmm
`
`
`
`n
`
`O
`
`
`
`o\oOm,uQJOImmmIHMOZMQEZOU*
`
`m.07..
`
`Oxford, Exh. 1011, p. 6
`
`Oxford, Exh. 1011, p. 6
`
`
`
`
`
`
`
`U.S. Patent
`US. Patent
`
`Jun. 11, 2002
`
`Sheet 6 0f 6
`
`US 6,404,907 B1
`US 6,404,907 B1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`mtoammomm>IEm95.03495..0+0:3
`
`
`
`
`
`
`
`4m4./i00
`
`om<>>m0m
`
`mmmm>mm
`
`
`
`
`
`whoa000mw>tEm2.300040:040+0:3
`
`AmocoowmE++Em2.2030mcE0+0:01;
`
`
`
`30:02.55++Em320300co—0+0.3::
`
`0:.
`
`ONF
`
`00—.
`
`Om.
`
`Ow
`
`Oxford, EXh. 1011, p. 7
`
`Oxford, Exh. 1011, p. 7
`
`
`
`
`
`
`1
`METHOD FOR SEQUENCING NUCLEIC
`ACIDS WITH REDUCED ERRORS
`
`US 6,404,907 B1
`
`2
`better numerical confidence value, provided that the
`numerical confidence value is better than the threshold.
`The confidence algorithm takes into account at least one,
`and preferably more than one of Several Specific character
`istics of the peaks in the data Sets that were not complimen
`tary.
`
`BRIEF DESCRIPTION OF THE DRAWING
`The invention will be described with respect to a drawing
`in Several figures, of which:
`FIG. 1 shows four regions of the HIV-1 genome
`Sequenced in the analysis of HIV according to the invention;
`FIG. 2 shows the improvement in accuracy in Selecting
`one of two HIV species using both forward and reverse
`Strands,
`FIGS. 3 and 4 show a comparison of text files represent
`ing apparent base sequences (3 Prime Sequence in FIG.
`3=SEQ ID No. 1, 5 Prime sequence in FIG.3=SEQ ID No.
`2, Reference sequence in FIG. 3=SEQ ID No. 3.3 Prime
`sequence in FIG. 4=SEQ ID No. 4, 5 Prime sequence in FIG.
`4=SEQ Id No. 5, Reference sequence in FIG.4=SEQ ID No.
`6, and Corrected sequence in FIG.4=SEQ ID No. 7);
`FIG. 5 shows a schematic representation of the deviation
`between forward (SEQ. ID No. 8) and reverse (SEQ. ID No.
`9) sequences is observed; and
`FIG. 6 shows sequence data for forward and reverse
`Strands in which regularity/evenness of peak Separation can
`be used as a key characteristic in determining a numerical
`confidence value.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`The purpose of the present invention is to provide a novel
`method and System for the reduction of errors in Sequencing
`data, and in particular to provide a method and System which
`can automate the process of reconciling forward and reverse
`Strand Sequences to readily provide Sequencing results of
`improved quality.
`In the present disclosure, the invention is illustrated using
`Sequence data taken from the TruGene HIV-1 ASSay manu
`factured by Visible Genetics Inc. In this case, data traces
`containing Sequence information for one amplicon from the
`Protease region and three amplicons from the reverse tran
`scriptase (RT) region as shown in FIG. 1 were considered.
`The reference to this Sequence is provided for purposes of
`example only, however, and to demonstrate the efficacy of
`the invention. Thus, in a broader Sense, the present invention
`may be applied to the Sequencing and error correction of
`Sequencing data for any polynucleotide, including DNA and
`RNA sequences for any gene or gene fragment.
`Error rates in HIV mutation Sequencing are in the range
`of 5 errors/1000 bases sequenced or higher for many home
`brew sequencing methods (single Strand). Using the method
`of the invention these rates are Substantially reduced to
`provide error rates that routinely are as low as 5 errors/100,
`000 bases and may reach levels as low as 2.5/1000000 bases
`for a 300 base call. FIG. 2 shows the improvement in
`accuracy in detecting one of two HIV Species using both
`forward and reverse Strands.
`The method of determining the Sequence of a Sample
`polynucleotide in accordance with the invention involves the
`following basic Steps:
`(a) obtaining forward and reverse data sets for the sample
`polynucleotide;
`(b) identifying the sequence of bases within the forward
`and reverse data Sets;
`
`This application claims priority from U.S. Provisional
`Appl. No. 60/090,887, filed Jun. 26, 1998, which is incor
`porated herein by reference.
`BACKGROUND OF THE INVENTION
`During routine Sequencing of DNA from Samples (such as
`HIV genotyping after RT-PCR conversion from RNA to
`DNA), normally only one strand (forward or reverse) of the
`DNA is actually Sequenced. In this case, the researcher must
`decide whether the output signal, and the resulting basecall
`is accurate based on their experience and skill in reading
`Sequence Signals. If the Signal and resulting basecall is of
`questionable reliability, then the researcher must start the
`Sequencing run again in the hope of obtaining a better Signal.
`In Some cases, the forward and reverse Stands are both
`sequenced, such as by using two dyes on a MICROGENE
`CLIPPER sequencer manufactured by Visible Genetics Inc.
`Forward and reverse Strand Sequencing provides the
`researcher with more information and allows the researcher
`to evaluate the quality and reliability of the data from both
`Strands. If the bases on both Strands complement each other
`as expected, then this helps to confirm the reliability of the
`Sequence information. However, in Some instances, after the
`Signal data from Sequencing is assigned a base (e.g. A, C, G
`or T), the corresponding base on the opposite Strand does not
`match. If the Signal and resulting basecall is of questionable
`reliability, then the researcher must start the Sequencing run
`again in the hope of obtaining better Signal. Alternatively,
`the researcher might manually review ("eyeball” analysis)
`the Signal data from both the forward and reverse Strands and
`make a decision on which Strand's data was more reliable.
`Unfortunately, any such decision will vary between indi
`vidual researchers and can lead to inconsistent determination
`of reliablity within the same Sequencing run. Furthermore,
`this kind of eyeball analysis requires Special training which
`makes it poorly Suited for application in routine diagnostic
`applications.
`It would therefore be desirable to have a method for
`Sequencing nucleic acid polymers in which discrepancies
`can be resolved using automated procedures, i.e. using
`computerized data analysis. It is an object of the present
`invention to provide Such a method, and an apparatus for
`performing the method.
`
`15
`
`25
`
`35
`
`40
`
`45
`
`SUMMARY OF THE INVENTION
`In accordance with the invention, nucleic acid polymers
`are Sequenced in a method comprising the Steps of
`(a) obtaining forward and reverse data sets for forward
`and reverse Strands of the Sample nucleic acid;
`(b) determining the apparent Sequence of bases for the
`forward and reverse data Sets;
`(c) comparing the apparent forward and reverse Sequences
`of bases for perfect complementarity to identify any
`deviations from complementarity in the apparent
`Sequence, any Such deviation presenting a choice
`between two bases, only one of which is correct;
`(d) applying a confidence algorithm to peaks in the data
`Set associated with a deviation to arrive at a numerical
`confidence value; and
`(e) comparing each numerical confidence value to a
`predetermined threshold and Selecting as the correct
`base the base represented by the peak which has the
`
`50
`
`55
`
`60
`
`65
`
`Oxford, Exh. 1011, p. 8
`
`
`
`3
`(c) comparing the sequence of bases within the forward
`and reverse data Sets to identify any deviations from
`perfect complementarity in the Sequences as deter
`mined for the two sets, and
`(d) applying a confidence algorithm to each deviation to
`Select the correct base from between the choices pre
`Sented by the identified forward and reverse Sequence.
`A variety of procedures for obtaining the forward and
`reverse data Sets for the Sample polynucleotide are known,
`and all can be applied in the present invention. In general,
`the Sample polynucleotide or a complementary copy of the
`Sample polynucleotide is combined with a Sequencing
`primer which is extended using a template-dependent poly
`merase enzyme in the presence of a chain-terminating
`nucleotide triphosphate (e.g. a dideoxynucleotide) to pro
`duce a set of Sequencing fragments the lengths of which
`reflect the positions of the base corresponding to the
`dideoxynucleotide triphosphate in the extended primer. By
`preparing one set of fragments for each type of base (e.g. A,
`C, G and T), the complete Sequence for the Sample poly
`nucleotide is determined. Forward and reverse Sequences are
`obtained by utilizing two primers which hybridize to the two
`strands of a duplex DNA molecule.
`The preparation of fragment mixtures providing forward
`and reverse Sequencing data Sets can be performed as
`individual reactions, or it can be concurrent. In a concurrent
`procedure, forward and reverse primers with different labels
`are extended in the same reaction mixture. This proceSS can
`involve a single extension cycle as disclosed by Wiemann et
`al., Anal. Biochem 224: 117-121 (1995), or multiple
`bi-directional cycles (preferably using CLIPTM sequencing
`chemistry, Visible Genetics Inc.) as described in Interna
`tional Patent Publication No. WO 97-41259 entitled
`"Method for Sequencing of nucleic acid polymers’ each of
`which are incorporated herein by reference. The proceSS can
`also involve multiple bi-directional cycles as described in
`U.S. patent application Ser. No. 09/009,483,now issued as
`U.S. Pat. No. 6,083,699, incorporated herein by reference to
`the extent permitted. Thus, fragment mixtures reflecting the
`Sequence of the forward and reverse Strands of the same
`polynucleotide are obtained by multiple cycles of a primer
`extension reaction in which two differently and distinguish
`ably labeled primers are extended in the presence of chain
`terminator nucleotides in a single reaction mixture. Pre
`ferred fragment mixtures utilize fluorescent labels which are
`detected following electrophoretic Separation to produce a
`forward and reverse data trace for each base position.
`The next step in the method of the invention is the
`identification of the apparent base Sequence for both the
`forward and reverse Strands of the Sample polynucleotide, a
`proceSS Sometimes referred to as “base-calling.” The proceSS
`of base-calling is theoretically quite Straightforward, requir
`ing nothing more than the Sequential reading of the bases
`from the overlapping data traces to produce a list of bases
`reflecting the Sequence. In practice, the process is more
`complicated, because of departures of actual data from the
`theoretical ideal. AS for the initial generation of the data
`traces, there are various methods known for dealing with
`these complications to facilitate automated base-calling
`from real data, including those disclosed in U.S. Pat. Nos.
`5,365,455 and 5,502,773, which are incorporated herein by
`reference.
`A preferred base-calling technique is that disclosed in
`U.S. Pat. No. 5,853,979 entitled “Method and system for
`DNA sequence determination and mutation detection with
`reference to a standard” and International Patent Publication
`WO 97-02488 entitled “Method and system for DNA
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,404,907 B1
`
`15
`
`25
`
`4
`Sequence determination and mutation detection,” each of
`which are incorporated herein by reference. In this method,
`a fragment pattern representing the positions of a Selected
`nucleic acid base within the polymer as a function of
`migration time or distance is evaluated to determine one or
`more “normalization coefficients.” These normalization
`coefficients reflect the displacement, Stretching or shrinking,
`and rate of Stretching or shrinking of the fragment pattern,
`or Segments thereof, which are necessary to obtain a Suitably
`high degree of correlation between the fragment pattern and
`a Standard fragment pattern which represents the positions of
`the Selected nucleic acid base within a Standard polymer
`actually having the known Sequence as a function of migra
`tion time or distance. The normalization coefficients are then
`applied to the fragment pattern to produce a normalized
`fragment pattern which is used for base-calling in a con
`ventional manner.
`The process of comparing the experimental fragment
`pattern and the Standard fragment pattern to arrive at nor
`malization coefficients can be carried out in any number of
`ways without departing from the present invention. In
`general, Suitable processes involve consideration of a num
`ber of trial normalizations and Selection of the trial normal
`ization which achieves the best fit in the model being
`employed. It will be understood, however, that the theoreti
`cal goal of achieving an exact overlap between an experi
`mental fragment pattern and a Standard fragment pattern
`may not be realistically achievable in practice, nor are
`repetitive and time consuming calculations to obtain perfect
`normalization necessary to the Successful use of the inven
`tion. Thus, when employing this method to facilitate base
`calling, the term “high degree of normalization” refers to the
`maximization of the normalization which is achievable
`within practical constraints. As a general rule, a point-for
`point correlation coefficient calculated for normalized frag
`ment patterns and the corresponding Standard fragment
`pattern of at least 0.8 is desirable, while a correlation
`coefficient of at least 0.95 is preferred.
`The result of the base-calling is two lists of bases, one for
`the forward strand and one for the reverse strand. Each list
`contains an indication of the base at a particular location
`(e.g. one-letter indications A. C., G and T). In Some
`instances, a list may include one or more blankS. Such
`blanks are inserted by the alignment program to maximize
`the extent of alignment and take into account the fact that
`insertions or deletions within one Strand may result in a shift
`of one portion of the Strand relative to the corresponding
`portion of the other strand. These two lists are suitably stored
`in a data processor performing the Sequence analysis as text
`files. The next step is the comparison of these two text files
`to determine whether there are any deviations from the
`theoretically expected perfect complementarity. This com
`parison process can be performed by any of Several methods.
`Common to these methods is the appropriate alignment of
`the text listings of bases to a common Starting point. This
`alignment involves an iterative testing of various alignment
`options to arrive at the best alignment. Iterative routines for
`accomplishing this alignment have been disclosed by
`Needleman et al., “A general method applicable to the
`Search for Similarities in amino acid Sequences of two
`proteins” J. Mol. Biol. 48: 443–453 (1970) and Smith et al.,
`“The identification of common molecular Subsequences' J.
`Mol. Biol. 147: 195-197 (1981).
`In a first approach, the text file listings of the forward
`Strand and the reverse Strand are each aligned with a text file
`listing of a Standard Sequence for the Sample polynucleotide
`being sequenced (eg. HIV-1 wild-type Sub-type B in the case
`
`Oxford, Exh. 1011, p. 9
`
`
`
`S
`of the HIV example discussed below). Alternatively, the text
`file listings for the forward and reverse Strands can be
`aligned to each other. It will be appreciated that the use of
`text files is only one option, however, and that the alignment
`may occur between the experimental data Sets, or between
`the experimental data Sets and a reference data Set.
`The important aspect is that the alignment Step produce
`information which will allow a determination of whether or
`not there is a deviation in the Sequence of the forward and
`reverse Strands from the expected complementarity. When
`Such a deviation is detected, the method of the invention
`provides an automated System for Selecting between the
`options presented and generating a “correct Sequence. This
`Selection proceSS can take place in Several StepS using a
`confidence algorithm.
`The confidence algorithm is used to assign a confidence
`value to each base in the forward or reverse text listing that
`is not confirmed by the other listing. The confidence value
`is a measure of the likelihood that a particular base identified
`in a text listing is the correct base. The confidence algorithm
`determines the confidence value for a peak by taking into
`account a variety of factors which reflect the quality of the
`data traces. Specific factors include:
`1. Separation distance between peaks,
`2. regularity/evenness of peak separation;
`3. peak height compared to neighbors (higher confidence
`if similar);
`4. peak area compared to neighbors (higher confidence if
`Similar);
`5. distance to neighbors compared to the local average
`distance to neighbors;
`6. resolution of the peak (lower confidence for lower
`resolution); and
`7. Signal-to-noise ratio in the region around the peak
`(lower confidence as the peak's size is more similar to
`the noise level).
`The number of characteristics and the particular charac
`teristics considered are a matter of design choice which is
`driven by the performance of the combination of chemistry
`and instrumentation which is used. In Some Systems, it may
`be the case that a few characteristics (e.g. two) are particu
`larly Sensitive to the causes of error, in which case deter
`mination of a numerical confidence value based on these
`characteristics is Sufficient.
`In a preferred embodiment, all of these factors are
`included in a weighted combination to arrive at the confi
`dence value, although the use of less than all of the factors
`may be considered, particularly where two factors are simi
`lar (Such as peak height and peak area). The confidence
`value is also lowered in Some recognized special cases:
`peak is a heterozygote;
`more than two overlapping peaks; or
`the peak is Small compared to its neighbors.
`The System evaluating the data traces may also attempt to fit
`groups of peaks to the Signal when the peaks are low
`resolution. These fitted peaks are also assigned confidence
`values using the above.
`The Specific weighting applied to each factor will vary to
`Some extent with the configuration of the Sequencing instru
`ment employed and the chemistry used since each experi
`mental combination will tend to produce different variability
`which affects the accuracy of the base call. Thus, for
`example, Some Sequencing chemistries are prone to greater
`variability in peak height than others, Such that variations in
`peak height might be of leSS Significance in the confidence
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,404,907 B1
`
`6
`algorithm. Initial determination of the appropriate weighting
`for a given System can be done using multiple calibration
`runs with a known Sequence and varying the weight given to
`the different factors to arrive at the most consistent and
`error-free results. In a preferred embodiment, the weights
`given to the different factors are updated heuristically as
`experimental Sequences are determined.
`Once the weights to be given to the confidence factors are
`determined, an overall numerical confidence value is calcu
`lated for each peak which indicates deviations from the
`expected match between the forward and reverse Sequences.
`This calculated confidence value is then compared to a
`predetermined threshold value to determine whether the
`confidence value is Sufficiently high (assuming that the
`characteristics are combined Such that a larger number is
`indicative of high confidence) to accept the base as being
`correct. It will be appreciated that the numerical value of this
`threshold will depend on many factors, including the units of
`the measurements used for the individual factors and the
`level of rigor which the individual user of the invention
`chooses to apply. Thus, it is not possible to give meaningful
`numerical examples of a threshold value. Preferably the
`threshold value should, however, be one which when applied
`in combination with the selected weights for the various
`factors to Standard sequence (Such as M13) produces error
`rates of less than 1/1,000 bases over the first 300 bases of the
`region Sequenced.
`It will be appreciated that the form of the confidence
`algorithm can be manipulated Such that a “good” result will
`be either high or low. When the algorithm is such that a
`“good” result is large, then the numerical confidence value
`is “better then the threshold if it exceeds the threshold.
`When the algorithm is Such that a "good” result is Small,
`then the numerical confidence value is better if it is less than
`the threshold. Similarly, in comparing the two numerical
`confidence values, the one that is “better is one that is larger
`in the first circumstance and Smaller in the Second.
`The application of the confidence values to the actual
`Sequences is Suitably performed in Several Successive StepS.
`If the forward and reverse Sequences do not confirm each
`other, then if a base exists (as opposed to a blank) in both the
`forward and reverse experimental Sequences and the confi
`dence measure of the better of the two is above the confi
`dence threshold currently set, then that base with the higher
`confidence measure is assigned. If both bases are above the
`confidence threshold, an additional comparison to the ref
`erence Sequence may be carried out, with the base which is
`the same as the reference being Selected in this instance.
`Identity with the reference Sequence is not a basis for
`Selection as the correct base in an experimental Sequence in
`the absence of a Sufficient confidence value.
`If only one of the experimental Sequences has a base at the
`location of the deviation and there is a base (as opposed to
`a blank) in the reference Sequence and the identified base in
`the experimental Sequence is above the confidence
`threshold, then the base from the experimental Sequence is
`used as the “correct” base (the base type in the reference
`Sequence is ignored, only the spacing information is used).
`If there is a blank in the reference Sequence and that is
`confirmed by either of the experimental Sequences, then a
`blank should be put in the corrected Sequence. If none of the
`above conditions apply, it is recommended to put an 'N' in
`the output sequence (Standard letter denoting all bases are
`present) and mark the location as uncorrected.
`The method of the invention is preferably carried out in an
`apparatus or System running appropriate computer code. The
`apparatus or System comprises at least a data processor
`
`Oxford, Exh. 1011, p. 10
`
`
`
`7
`operably programmed to perform the Steps of identifying the
`Sequence of bases within the forward and reverse data Sets;
`comparing the Sequence of bases within the forward and
`reverse data Sets to identify any deviations from perfect
`complementarity in the Sequences as determined for the two
`Sets, and applying a confidence algorithm to each deviation
`to Select the correct base from between the choices presented
`by the identified forward and reverse Sequence. The appa
`ratus or System further comprises means for obtaining
`forward and reverse data Sets for the Sample polynucleotide.
`In the case of an integrated System, this may be a direct data
`feed from an electrophoresis apparatus connected to the data
`processor. In a distributed System, the data Sets can be
`obtained via a connection on a local area network (LAN), a
`wide area network (WAN), by modem or cable modem
`transmission or by insertion of a portable Storage medium
`(diskette, tape etc.) into a drive capable of reading the
`portable Storage medium. The apparatus or System further
`comprises means for providing useful output of the deter
`mined Sequence. This may be as a video display or as a
`Sequence listing Stored on a storage medium Such as a disk
`drive or read/write CD-ROM.
`FIGS. 3 and 4 show screen output from an Intel(R)
`processor-based Hewlett-Packard Vectra VL computer
`(running an OpenStep Mach operating System) in which one
`amplicon of the RT region is being corrected for Sequencing
`errors by analyzing both Strands (shown as 3 prime (text 1)
`and 5 prime (text 2)). The highlighted bases on the reference
`show places of disagreement between the two strands (text
`1 and text 2). The highlighted text in the Corrected area
`represent corrections according to the above embodiment of
`
`15
`
`25
`
`US 6,404,907 B1
`
`8
`the present invention. In this case, the Software allows
`adjustment of the confidence threshold, which as shown is
`Set to 80%.
`FIG. 5 shows a chart indicating possible outcomes if
`basecall for a particular base is not confirmed by each Strand.
`In this case, if the base in text 1 and text 2 are a, then they
`have confirming sequence (e.g. forward Strand is A and
`reverse strand is T). N/Ameans that the confidence value for
`a particular base for either strand was below threshold (50%
`in this figure) and correction was not possible.
`FIG. 6 shows sequence data for the forward and reverse
`Strand in which the parameter of “regularity/evenness of
`peak Separation' has been measured as lane-to-lane shift in
`Seconds (y-axis) with the base location on the X-axis. The
`reverse Strand shows an erratic shift in lane timing and
`indicates a compression in Signals which is unreliable. In
`contrast, this parameter shows more regular peak Separation
`in the forward strand which may provide for an overall
`higher confidence value for bases on the forward Strand.
`In the course of evaluating the method and System of the
`invention with HIV sequence data, it has been observed that
`errors in the Sequence data for one Strand tend to be random
`with respect to errors in the Sequence data of the opposite
`Strand. As a result, it is less likely that an error will occur in
`the Sequence data of both Strands at the same base position.
`These errors can thereby be corrected should the reliability
`of one strand (confidence) be of a sufficient level. This
`observation highlights the utility of the present invention to
`create corrected Sequence data in HIV genotyping, and it is
`believed that comparable benefits will be obtained for
`Sequencing in general.
`
`SEQUENCE LISTING
`
`<160> NUMBER OF SEQ ID NOS: 9
`
`<21 Oc
`<211
`<212>
`<213>
`<22O >
`<223>
`
`SEQ ID NO 1
`LENGTH 66
`TYPE DNA
`ORGANISM: Human immunodeficiency virus
`FEATURE
`OTHER INFORMATION: 3 Prime Reverse Transcriptase region
`
`<400 SEQUENCE: 1
`
`ggctgtact g to catttatc aggatggagt to ataac cat coaatggaat g gaggct citt
`
`60
`
`gct gat
`
`66
`
`SEQ ID NO 2
`LENGTH 67
`TYPE DNA
`ORGANISM: Human immunodeficiency virus
`FEATURE
`OTHER INFORMATION: 5 Prime Reverse Transcriptase region
`
`<400 SEQUENCE: 2
`
`ggctgtact g to catttatc aggatggagt to at aaccoa to caatggaa t ggaggctct
`
`60
`
`tgctgat
`
`67
`
`SEQ ID NO 3
`LENGTH 67
`TYPE DNA
`ORGANISM: Human immunodeficiency virus
`FEATURE
`OTHER INFORMATION: Reverence Reverse Transcriptase region
`
`Oxford, Exh. 1011, p.