`(19) World Intellectual Property
`=
`
`Organization
`
`International Bureau
`
`=z
`Soe=\
`
`(43) International Publication Date
`26 September 2013 (26.09.2013) WIFO|IPCT
`
`(10) International Publication Number
`WO 2013/142389 Al
`
`(51) International Patent Classification:
`C04B 20/04 (2006.01)
`
`(21) International Application Number:
`
`(74)
`
`(81)
`
`PCT/US2013/032665
`
`(22) International Filing Date:
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`15 March 2013 (15.03.2013)
`
`English
`
`English
`
`(30) Priority Data:
`US
`20 March 2012 (20.03.2012)
`61/613,413
`US
`17 April 2012 (17.04.2012)
`61/625,623
`US
`17 April 2012 (17.04.2012)
`61/625,319
`UNIVERSITY
`OF WASHINGTON (84)
`(71) Applicant:
`THROUGH ITS CENTER FOR COMMERCIALIZA-
`TION [US/US]; 4311 11th Avenue NE, Seattle, WA
`98105-4608 (US).
`
`(72)
`
`Inventors: SCHMITT, Michael; Seattle, WA 98105 (US).
`SALK,
`Jesse;
`Seattle, WA 98105
`(US). LOEB,
`Lawrence, A.; Bellevue, WA (US).
`
`Agent: DUEPPEN,Lara, J.; Perkins Coie LLP, P.O. Box
`1208, Seattle, WA 98111-1208 (US).
`
`Designated States (unless otherwise indicated, for every
`kind of national protection available): AE, AG, AL, AM,
`AO, AT, AU, AZ, BA, BB, BG, BH, BN, BR, BW, BY,
`BZ, CA, CH, CL, CN, CO, CR, CU, CZ, DE, DK, DM,
`DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM,GT,
`HN, HR, HU,ID,IL, IN, IS, JP, KE, KG, KM, KN, KP,
`KR, KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD,
`ME, MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI
`NO, NZ, OM,PA, PE, PG, PH, PL, PT, QA, RO, RS, RU,
`RW, SC, SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ,
`TM, TN, TR, TT, TZ, UA, UG, US, UZ, VC, VN, ZA,
`ZM, ZW.
`
`Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, SZ, TZ,
`UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU,TJ,
`TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK,
`EE, ES, FI, FR, GB, GR, HR, HU,IE,IS, IT, LT, LU, LV,
`MC, MK,MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM,
`TR), OAPI (BF, BJ, CF, CG, CL, CM, GA, GN, GQ, GW,
`ML, MR,NE, SN, TD, TG).
`
`(54) Title: METHODS OF LOWERING THE ERROR RATE OF MASSIVELY PARALLEL DNA SEQUENCING USING DU-
`PLEX CONSENSUS SEQUENCING
`
`[Continued on next page]
`
`Figure 1
`
`- SMI sequences:
`L-tailed ONA fragment
`
`
`Ary
`
`Ligation &
`size selection
`
`
`
`
`
`(57) Abstract: Next Generation DNA sequencing promises to revolutionize
`clinical medicine and basic research. However, while this technology has
`the capacity to generate hundreds of billions of nucleotides of DNA se-
`quence in a single experiment, the error rate of approximately 1% results in
`hundreds of millions of sequencing mistakes. These scattered errors can be
`tolerated in some applications but become extremely problematic when
`"deep sequencing" genetically heterogeneous mixtures, such as tumors or
`mixed microbial populations. To overcome limitations in sequencing accur-
`acy, a method Duplex Consensus Sequencing (DCS) is provided. This ap-
`proach greatly reduces errors by independently tagging and sequencing each
`of the two strands of a DNA duplex. As the two strands are complementary,
`true mutations are found at the same position in both strands. In contrast,
`PCR or sequencingerrors will result in errors in only one strand.
`
`2
`2
`
`BCR with fow-cell
`adaptor sequence
`
` " Capture target
`
`regions
`
`ce Second round POR
`
`a Massively oarailel sequencing
`
`PGDX EX. 1006
`Page 1 of 70
`
`
`
`
`
`
`
`wo2013/142389A1|IIIMINIIMATUINNIATACAIATTATAUTA
`
`PGDX EX. 1006
`Page 1 of 70
`
`
`
`WO 2013/142389 AIMMTIMITTAIINRIATTATAT AAEAAAA
`
`Published: —_with sequence listing part ofdescription (Rule 5.2(a))
`
`—__with international search report (Art. 21(3))
`
`PGDX EX. 1006
`Page 2 of 70
`
`PGDX EX. 1006
`Page 2 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`METHODS OF LOWERING THE ERROR RATE OF MASSIVELY PARALLEL DNA
`
`SEQUENCING USING DUPLEX CONSENSUS SEQUENCING
`
`PRIORITY CLAIM
`
`[0001]
`
`This application claims priority to U.S. Provisional Patent Application No.
`
`61/613,413, filed March 20, 2012; U.S. Provisional Patent Application No. 61/625,623,
`
`filed April 17, 2012; and U.S. Provisional Patent Application No. 61/625,319, filed April
`
`17, 2012; the subject matter of all of which are hereby incorporated by referenceas if
`
`fully set forth herein.
`
`STATEMENT OF GOVERNMENT INTEREST
`
`[0002]
`
`The present invention was made with government support under Grant
`
`Nos. RO1 CA115802 and RO1 CA102029 awarded by the National Institutes of Health.
`
`The Government has certain rights in the invention.
`
`BACKGROUND
`
`[0003]
`
`The advent of massively parallel DNA sequencing has ushered in a new
`
`era of genomic exploration by making simultaneous genotyping of hundreds ofbillions
`
`of base-pairs possible at small fraction of the time and cost of traditional Sanger
`
`methods [1]. Because these technologies digitally tabulate the sequence of many
`
`individual DNA fragments, unlike conventional
`
`techniques which simply report
`
`the
`
`average genotype of an aggregate collection of molecules, they offer the unique ability
`
`to detect minor variants within heterogeneous mixtures[2].
`
`[0004]
`
`This concept of “deep sequencing” has been implemented in a variety
`
`fields including metagenomics [3, 4], paleogenomics [5], forensics [6], and human
`
`genetics [7, 8] to disentangle subpopulations in complex biological samples. Clinical
`
`applications, such prenatal screening for fetal aneuploidy [9, 10], early detection of
`
`cancer [11] and monitoring its response to therapy [12, 13] with nucleic acid-based
`
`serum biomarkers, are rapidly being developed. Exceptional diversity within microbial
`
`[14, 15] viral [16-18] and tumor cell populations [19, 20] has been characterized through
`
`next-generation sequencing, and many low-frequency, drug-resistant variants of
`
`PGDX EX. 1006
`Page 3 of 70
`
`PGDX EX. 1006
`Page 3 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`therapeutic importance have been so identified [12, 21, 22]. Previously unappreciated
`
`intra-organismal mosasism in both the nuclear [23] and mitochondrial [24, 25] genome
`
`has been revealed by these technologies, and such somatic heterogeneity, along with
`
`that arising within the adaptive immune system [13], may be an important factor in
`
`phenotypic variability of disease.
`
`[0005]
`
`Deep sequencing, however, has limitations. Although,
`
`in theory, DNA
`
`subpopulations of any size should be detectable when deep sequencing a sufficient
`
`number of molecules, a practical
`
`limit of detection is imposed by errors introduced
`
`during sample preparation and sequencing.
`
`PCR amplification of heterogeneous
`
`mixtures can result
`
`in population skewing due to stoichastic and non-stoichastic
`
`amplification biases and lead to over- or under-representation of particular variants [26].
`
`Polymerase mistakes during pre-amplification generate point mutations resulting from
`
`base mis-incorporations and rearrangements due to template switching [26, 27].
`
`Combined with the additional errors that arise during cluster amplification, cycle
`
`sequencing and image analysis, approximately 1% of bases are incorrectly identified,
`
`depending on the specific platform and sequencecontext [2, 28]. This backgroundlevel
`
`of artifactual heterogeneity establishes a limit below which the presence of true rare
`
`variants is obscured [29].
`
`[0006]
`
`A variety of improvements at the level of biochemistry [80-32] and data
`
`processing [19, 21, 28, 32, 33] have been developed to improve sequencing accuracy.
`
`The ability to resolve subpopulations below 0.1%, however, has remained elusive.
`
`Although several groups have attempted to increase sensitivity of sequencing, several
`
`limitations remain. For example techniques whereby DNA fragments to be sequenced
`
`are each uniquely tagged [34, 35] prior to amplification [36-41] have been reported.
`
`Because all amplicons derived from a particular starting molecule will bear its specific
`
`tag, any variation in the sequence or copy number of identically tagged sequencing
`
`reads can be discounted as technical error. This approach has been used to improve
`
`counting accuracy of DNA [38, 39, 41] and RNA templates [37, 38, 40] and to correct
`
`base errors arising during PCR or sequencing [36, 37, 39]. Kinde et. al. reported a
`
`reduction in error frequency of approximately 20-fold with a tagging method that
`
`is
`
`based on labeling single-stranded DNA fragments with a primer containing a 14 bp
`
`-2-
`
`PGDX EX. 1006
`Page 4 of 70
`
`PGDX EX. 1006
`Page 4 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`degenerate sequence. This allowed for an observed mutation frequency of ~0.001%
`
`mutations/bp in normal human genomic DNA [36]. Nevertheless, a number of highly
`
`sensitive genetic assays haveindicated that the true mutation frequency in normal cells
`
`is likely to be far lower, with estimates of per-nucleotide mutation frequencies generally
`ranging from 10° to 10°’ [42]. Thus, the mutations seen in normal human genomic
`
`DNAbyKinde et al. are likely the result of significant technical artifacts.
`
`[0007]
`
`Traditionally, next-generation sequencing platforms rely upon generation
`
`of sequence data from a single strand of DNA. As a consequence, artifactual mutations
`
`introduced during the initial rounds of PCR amplification are undetectable as errors -
`
`even with tagging techniques- if the base change is propagatedto all subsequent PCR
`
`duplicates. Several types of DNA damage are highly mutagenic and maylead to this
`
`scenario. Spontaneous DNA damage arising from normal metabolic processes results
`
`in thousands of damaging events per cell per day [43].
`
`In addition to damage from
`
`oxidative cellular processes, further DNA damage is generated ex vivo during tissue
`
`processing and DNAextraction [44]. These damage events can result
`
`in frequent
`
`copying errors by DNA polymerases: for example a common DNAlesion arising from
`
`oxidative damage, 8-oxo-guanine, has the propensity to incorrectly pair with adenine
`
`during complementary strand extension with an overall efficiency greater than that of
`
`correct pairing with cytosine, and thus can contribute a large frequency of artifactual
`
`G-—T mutations [45]. Likewise, deamination of cytosine to form uracil is a particularly
`
`common event which leads to the inappropriate insertion of adenine during PCR, thus
`
`producing artifactual C-+T mutations with a frequency approaching 100% [46].
`
`[0008]
`
`It would be desirable to develop an approach for
`
`tag-based error
`
`correction, which reducesor eliminates artifactual mutations arising from DNA damage,
`
`PCR errors, and sequencing errors; allows rare variants in heterogeneous populations
`
`to be detected with unprecedented sensitivity; and which capitalizes on the redundant
`
`information stored in complexed double-stranded DNA.
`
`SUMMARY
`
`[0009]
`
`In one embodiment, a single molecule identifier (SMI) adaptor molecule
`
`for use in sequencing a double-stranded target nucleic acid molecule is provided. Said
`
`-3-
`
`PGDX EX. 1006
`Page 5 of 70
`
`PGDX EX. 1006
`Page 5 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`SMI adaptor molecule includes a single molecule identifier (SMI) sequence which
`
`comprises a degenerate or semi-degenerate DNA sequence; and an SMI ligation
`
`adaptor that allows the SMI adaptor molecule to be ligated to the double-stranded target
`
`nucleic acid sequence. The SMI sequence maybesingle-stranded or double-stranded.
`
`In some embodiments, the double-stranded target nucleic acid molecule is a double-
`
`stranded DNA or RNA molecule.
`
`[0010]
`
`In another embodiment, a method of obtaining the sequence of a double-
`
`stranded target nucleic acid is provided (also known as Duplex Consensus Sequencing
`
`or DCS) is provided. Such a method mayinclude steps of ligating a double-stranded
`
`target nucleic acid molecule to at least one SMI adaptor molecule to form a double-
`
`stranded SMl-target nucleic acid complex; amplifying the double-stranded SMl-target
`
`nucleic acid complex, resulting in a set of amplified SMl-target nucleic acid products;
`
`and sequencing the amplified SMI-target nucleic acid products.
`
`[0011]
`
`In some embodiments, the method may additionally include generating an
`
`error-corrected double-stranded consensus sequence by (i) grouping the sequenced
`
`SMl-target nucleic acid products into families of paired target nucleic acid strands based
`
`on a common set of SMI sequences; and (ii) removing paired target nucleic acid strands
`
`having one or more nucleotide positions where the paired target nucleic acid strands
`
`are non-complementary (or alternatively removing individual nucleotide positions in
`
`cases where the sequence at the nucleotide position under consideration disagrees
`
`among the two strands).
`
`In further embodiments, the method confirms the presence of
`
`a true mutation by (i) identifying a mutation present in the paired target nucleic acid
`
`strands having one or more nucleotide positions that disagree;
`
`(ii) comparing the
`
`mutation present in the paired target nucleic acid strands to the error corrected double-
`
`stranded consensus sequence; and (iii) confirming the presence of a true mutation
`
`when the mutation is present on both of the target nucleic acid strands and appearsin
`
`all membersof a paired target nucleic acid family.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0012]
`
`Figure 1
`
`illustrates an overview of Duplex Consensus Sequencing.
`
`Sheared double-stranded DNA that has been end-repaired and T-tailed is combined
`
`-4-
`
`PGDX EX. 1006
`Page 6 of 70
`
`PGDX EX. 1006
`Page 6 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`with A-tailed SMI adaptors and ligated according to one embodiment. Because every
`
`adaptor contains a unique, double-stranded, complementary n-mer random tag on each
`
`end (n-mer = 12 bp according to one embodiment), every DNA fragment becomes
`
`labeled with two distinct SMI sequences (arbitrarily designated a and8Bin the single
`
`capture event shown). After size-selecting for appropriate length fragments, PCR
`
`amplification with primers containing Illumina flow-cell-compatible tails is carried out to
`
`generate families of PCR duplicates. By virtue of the asymmetric nature of adapted
`
`fragments, two types of PCR products are produced from each capture event. Those
`
`derived from one strand will have the a SMI sequence adjacent to flow-cell sequence 1
`
`and the B SMI sequence adjacent to flow cell sequence 2. PCR products originating
`
`from the complementary strand are labeled reciprocally.
`
`[0013]
`
`Figure 2 illustrates Single Molecule Identifier (SMI) adaptor synthesis
`
`according to one embodiment. Oligonucleotides are annealed and the complement of
`
`the degenerate lower arm sequence (N’s) plus adjacent fixed bases is produced by
`
`polymerase extension of the upper strand in the presence of all four dNTPs. After
`
`reaction cleanup, complete adaptor A-tailing is ensured by extended incubation with
`
`polymerase and dATP.
`
`[0014]
`
`Figure
`
`3_
`
`illustrates
`
`error
`
`correction
`
`through Duplex Consensus
`
`Sequencing (DCS) analysis according to one embodiment.
`
`(a-c) shows sequence
`
`reads (brown) sharing a unique set of SMI tags are grouped into paired families with
`
`members having strand identifiers in either the a8 or Ba orientation. Each family pair
`
`reflects one double-stranded DNA fragment.
`
`(a) shows mutations (spots) present in
`
`only one or a few family members representing sequencing mistakes or PCR-introduced
`
`errors occurring late in amplification.
`
`(b) shows mutations occurring in many or all
`
`members of one family in a pair representing mutations scored on only one of the two
`
`strands, which can be due to PCR errors arising during the first round of amplification
`
`such as might occur when copying across sites of mutagenic DNA damage.
`
`(c) shows
`
`true mutations (* arrow) present on both strands of a captured fragment appear in all
`
`members of a family pair. While artifactual mutations may co-occur in a family pair with
`
`a true mutation, these can be independently identified and discounted when producing
`
`(d) an error-corrected consensus sequence(i.e., single stranded consensus sequence)
`
`-5-
`
`PGDX EX. 1006
`Page 7 of 70
`
`PGDX EX. 1006
`Page 7 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`(+ arrow) for each duplex.
`
`(e) shows consensus sequences from all
`
`independently
`
`captured,
`
`randomly sheared fragments containing a particular genomic site are
`
`identified and (f) compared to determine the frequency of genetic variants at this locus
`
`within the sampled population.
`
`[0015]
`
`Figure 4 illustrates an example of how a SMI sequence with n-mers of 4
`
`nucleotides in length (4-mers) are read by Duplex Consensus Sequencing (DCS)
`
`according to some embodiments.
`
`(A) shows the 4-mers with the PCR primer binding
`
`sites (or flow cell sequences) 1 and 2 indicated at each end.
`
`(B) shows the same
`
`molecules as in (A) but with the strands separated and the lower strand now written in
`
`the 5’-3’ direction. When these molecules are amplified with PCR and sequenced, they
`
`will yield the following sequence reads: The top strand will give a read 1 file of TAAC---
`
`and a read 2 file of GCCA---. Combining the read 1 and read 2 tags will give
`
`TAACCGGAas the SMI for the top strand. The bottom strand will give a read 1 file of
`
`CGGA---- and a read 2 file of TAAC---. Combining the read 1 and read 2 tags will give
`
`CGGATAACas the SMI for the bottom strand.
`
`(C) illustrates the orientation of paired
`
`strand mutations in DCS.
`
`In the initial DNA duplex shown in Figures 4A and 4B, a
`
`mutation “x” (which is paired to a complementary nucleotide “y”) is shown on the left
`
`side of the DNA duplex. The “x” will appear in read 1, and the complementary mutation
`
`on the opposite strand,
`
`“y,” will appear in read 2. Specifically, this would appear as “x”
`
`in both read 1 and read 2 data, because “y” in read 2 is read out as “x” by the
`
`sequencer owing to the nature of
`
`the sequencing primers, which generate the
`
`complementary sequence during read 2.
`
`[0016]
`
`Figure 5 illustrates duplex sequencing of human mitochondrial DNA. (A)
`
`Overall mutation frequency as measured by a standard sequencing approach, SSCS,
`
`and DCS.
`
`(B) Pattern of mutation in human mitochondrial DNA by a standard
`
`sequencing approach. The mutation frequency (vertical axis) is plotted for every position
`
`in the ~16-kb mitochondrial genome. Due to the substantial background of technical
`
`error, no obvious mutational pattern is discernible by this method. (C) DCS analysis
`
`eliminates sequencing artifacts and reveals the true distribution of mitochondrial
`
`mutations to include a striking excess adjacent to the mtDNAorigin of replication. (D)
`
`-6-
`
`PGDX EX. 1006
`Page 8 of 70
`
`PGDX EX. 1006
`Page 8 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`SSCSanalysis yields a large excess of GT mutations relative to complementary C>A
`
`mutations, consistent with artifacts from damaged-induced 8-oxo-G lesions during PCR.
`
`All significant (P < 0.05) differences between paired reciprocal mutation frequencies are
`
`noted. (E) DCS analysis removes the SSCS strand bias and reveals the true mtDNA
`
`mutational spectrum to be characterized by an excessoftransitions.
`
`[0017]
`
`Figure
`
`6
`
`shows
`
`that
`
`consensus
`
`sequencing
`
`removes
`
`artifactual
`
`sequencing errors as compared to Raw Reads. Duplex Consensus Sequencing (DCS)
`
`results in an approximately equal number of mutations as the reference and single
`
`strand consensus sequencing (SSCS) .
`
`[0018]
`
`Figure 7 illustrates duplex sequencing of M13mp2 DNA. (A) Single-strand
`
`consensus sequences (SSCSs) reveal a large excess of G>A/C-—>T and G—T/C—A
`
`mutations, whereas duplex consensus sequences (DCSs) yield a balanced spectrum.
`
`Mutation frequencies are groupedinto reciprocal mispairs, as DCS analysis only scores
`
`mutations present in both strands of duplex DNA. All significant (P < 0.05) differences
`
`between DCS analysis
`
`and
`
`the
`
`literature
`
`reference values
`
`are
`
`noted.
`
`(B)
`
`Complementary types of mutations should occur at approximately equal frequencies
`
`within a DNA fragment population derived from duplex molecules. However, SSCS
`
`analysis yields a 15-fold excess of G—T mutations relative to C->A mutations and an
`
`11-fold excess of C—T mutations relative to G-—A mutations. All significant (P < 0.05)
`
`differences between paired reciprocal mutation frequencies are noted.
`
`[0019]
`
`Figure 8 showsthe effect of DNA damage on the mutation spectrum. DNA
`
`damage was induced by incubating purified M13mp2 DNA with hydrogen peroxide and
`
`FeSO4. (A) SSCS analysis reveals a further elevation from baseline of G—T mutations,
`
`indicating these events to be the artifactual consequence of nucleotide oxidation. All
`
`significant
`
`(P < 0.05) changes from baseline mutation frequencies are noted.
`
`(B)
`
`Induced DNA damage had no effect on the overall frequency or spectrum of DCS
`
`mutations.
`
`[0020]
`
`Figure 9 shows duplex sequencing results in accurate recovery of spiked-
`
`control mutations. A series of variants of M13mp2 DNA, each harboring a known
`
`single-nucleotide substitution, were mixed in together at known ratios and the mixture
`
`-7-
`
`PGDX EX. 1006
`Page 9 of 70
`
`PGDX EX. 1006
`Page 9 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`was sequenced to ~20,000-fold final depth. Standard sequencing analysis cannot
`
`accurately distinguish mutants present at a ratio of less than 1/100, becauseartifactural
`
`mutations occurring at every position obscure the presence of less abundant true
`
`mutations,
`
`rendering apparent
`
`recovery greater than 100%.
`
`Duplex consensus
`
`sequences,
`
`in contrast, accurately identify spiked-in mutations down to the lowest
`
`tested ratio of 1/10,000.
`
`[0021]
`
`Figure 10 is a Python Code that may used to carry out methods described
`
`herein according to one embodiment.
`
`DETAILED DESCRIPTION
`
`[0022]
`
`Single molecule identifier adaptors and methods for their use are provided
`
`herein. According to the embodiments described herein, a single molecule identifier
`
`(SMI) adaptor molecule is provided. Said SMI adaptor molecule is double stranded,
`
`and may include a single molecule identifier (SMI) sequence, and an SMI
`
`ligation
`
`adaptor (Figure 2). Optionally, the SMI adaptor molecule further includes at least two
`
`PCR primerbinding sites, at least two sequencing primer binding sites, or both.
`
`[0023]
`
`The SMI adaptor molecule may form a “Y-shape’”or a “hairpin shape.” In
`
`some embodiments, the SMI adaptor molecule is a “Y-shaped” adaptor, which allows
`
`both strands to be independently amplified by a PCR method prior to sequencing
`
`because both the top and bottom strands have binding sites for PCR primers FC1 and
`
`FC2 as shown in the examples below. A schematic of a Y-shaped SMI adaptor
`
`molecule is also shown in Figure 2. A Y-shaped SMI adaptor requires successful
`
`amplification and recovery of both strands of the SMI adaptor molecule.
`
`In one
`
`embodiment, a modification that would simplify consistent recovery of both strands
`
`entails ligation of a Y-shaped SMI adaptor molecule to one end of a DNA duplex
`
`molecule, and ligation of a “U-shaped”linker to the other end of the molecule. PCR
`
`amplification of the hairpin-shaped product will then yield a linear fragment with flow cell
`
`sequences on either end. Distinct PCR primer binding sites (or flow cell sequences
`
`FC1 and FC2) will flank the DNA sequence corresponding to each of the two SMI
`
`adaptor molecule strands, and a given sequence seen in Read 1 will then have the
`
`sequence corresponding to the complementary DNA duplex strand seen in Read 2.
`
`-8-
`
`PGDX EX. 1006
`Page 10 of 70
`
`PGDX EX. 1006
`Page 10 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`Mutations are scored only if they are seen on both ends of the molecule (corresponding
`
`to each strand of the original double-stranded fragment),
`
`i.e. at the same position in
`
`both Read 1 and Read 2. This design may be accomplished as described in the
`
`examplesrelating to double stranded SMI sequencetags.
`
`[0024]
`
`In other embodiments, the SMI adaptor molecule is a “hairpin” shaped (or
`
`“U-shaped”) adaptor. A hairpin DNA product can be used for error correction, as this
`
`product contains both of the two DNA strands. Such an approach allows for reduction
`
`of a given sequencing error rate N to a lower rate of N*N*(1/3), as independent
`
`sequencing errors would need to occur on both strands, and the same error amongall
`
`three possible base substitutions would need to occur on both strands. For example,
`
`the error rate of 1/100 in the case of Illumina sequencing [32] would be reduced to
`
`(1/100)*(1/100)*(1/3) = 1/30,000.
`
`[0025]
`
`An additional, more remarkable reduction in errors can be obtained by
`
`inclusion of a single-stranded SMI
`
`in either the hairpin adaptor or the “Y-shaped”
`
`adaptor will also function to label both of the two DNA strands. Amplification of hairpin-
`
`shaped DNA maybedifficult as the polymerase must synthesize through a product
`
`containing significant regions of self-complementarity, however, amplification of hairpin-
`
`shaped structures has already been established in the technique of hairpin PCR, as
`
`described below. Amplification using hairpin PCR is further described in detail in U.S.
`
`Patent No. 7,452,699, the subject matter of which is hereby incorporated by reference
`
`as if fully set forth herein.
`
`[0026]
`
`According to the embodiments described herein, the SMI sequence (or
`
`“tag”) may be a double-stranded, complementary SMI sequence or a single-stranded
`
`SMI sequence.
`
`In some embodiments, the SMI adaptor molecule includes an SMI
`
`sequence (or “tag”) of nucleotides that is degenerate or semi-degenerate.
`
`In some
`
`embodiments, the degenerate or semi-degenerate SMI sequence may be a random
`
`degenerate sequence. A double-stranded SMI sequenceincludesa first degenerate or
`
`semi-degenerate nucleotide n-mer sequence and a second n-mer sequencethatis
`
`complementaryto the first degenerate or semi-degenerate nucleotide n-mer sequence,
`
`while a single-stranded SMI sequenceincludes a first degenerate or semi-degenerate
`
`-9-
`
`PGDX EX. 1006
`Page 11 of 70
`
`PGDX EX. 1006
`Page 11 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`nucleotide n-mer sequence.
`
`The first and/or second degenerate or semi-degenerate
`
`nucleotide n-mer sequences may be any suitable length to produce a sufficiently large
`
`number of unique tags to label a set of sheared DNA fragments from a segment of DNA.
`
`Each n-mer sequence may be between approximately 3 to 20 nucleotides in length.
`
`Therefore, each n-mer sequence may be approximately 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
`
`14, 15, 16, 17, 18, 19, 20 nucleotides in length.
`
`In one embodiment, the SMI sequence
`
`is a random degenerate nucleotide n-mer sequence which is 12 nucleotides in length.
`
`A 12 nucleotide SMI n-mer sequence thatis ligated to each end of a target nucleic acid
`molecule, as described in the Example below, results in generation of up to 4” (i.e., 2.8
`x 10") distinct tag sequences.
`
`[0027]
`
`In
`
`some embodiments,
`
`the SMI
`
`tag nucleotide sequence may be
`
`completely random and degenerate, wherein each sequence position may be any
`
`nucleotide.
`
`(i.e., each position, represented by “X,” is not
`
`limited, and may be an
`
`adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U)) or any other natural or
`
`non-natural DNA or RNA nucleotide or nucleotide-like substance or analog with base-
`
`pairing properties (e.g., xanthosine, inosine, hypoxanthine, xanthine, 7-methylguanine,
`
`7-methylguanosine, 5,6-dihydrouracil, 5-methylcytosine, dihydouridine,
`
`isocytosine,
`
`isoguanine, deoxynucleosides, nucleosides, peptide nucleic acids, locked nucleic acids,
`
`glycol nucleic acids and threose nucleic acids). The term “nucleotide” as described
`
`herein, refers to any and all nucleotide or any suitable natural or non-natural DNA or
`
`RNA nucleotide or nucleotide-like substance or analog with base pairing properties as
`
`described above.
`
`In other embodiments, the sequences need not contain all possible
`
`bases at each position. The degenerate or semi-degenerate n-mer sequences may be
`
`generated by a polymerase-mediated method described in the Example below, or may
`
`be generated by preparing and annealing a library of individual oligonucleotides of
`
`known sequence. Alternatively, any degenerate or semi-degenerate n-mer sequences
`
`may be a randomly or non-randomly fragmented double stranded DNA molecule from
`
`any alternative source that differs from the target DNA source.
`
`In some embodiments,
`
`the alternative source is a genome or plasmid derived from bacteria, an organism other
`
`than that of the target DNA, or a combination of such alternative organisms or sources.
`
`The random or non-random fragmented DNA maybeintroduced into SMI adaptors to
`
`-10-
`
`PGDX EX. 1006
`Page 12 of 70
`
`PGDX EX. 1006
`Page 12 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`serve as variable tags. This may be accomplished through enzymatic ligation or any
`
`other method knownin the art.
`
`[0028]
`
`In some embodiments,
`
`the SMI adaptor molecules are ligated to both
`
`ends of a target nucleic acid molecule, and then this complex is used according to the
`
`methods described below.
`
`In certain embodiments,
`
`it
`
`is not necessary to include n-
`
`mers on both adapter ends, however, it is more convenient because it means that one
`
`does not have to use two different
`
`types of adaptors and then select for ligated
`
`fragments that have one of each type rather than two of one type. The ability to
`
`determine which strand is whichis still possible in the situation wherein only one of the
`
`two adaptors has a double-stranded SMI sequence.
`
`[0029]
`
`In some embodiments, the SMI adaptor molecule may optionally include a
`
`double-stranded fixed reference sequence downstream of the n-mer sequencesto help
`
`make ligation more uniform and help computationally filter out errors due to ligation
`
`problems with improperly synthesized adaptors. Each strand of the double-stranded
`
`fixed reference sequence may be 4 or 5 nucleotides in length sequence, however, the
`
`fixed reference sequence maybe any suitable length including, but notlimited to 3, 4, 5
`
`or 6 nucleotidesin length.
`
`[0030]
`
`The SMI
`
`ligation adaptor may be any suitable ligation adaptor that
`
`is
`
`complementary to a ligation adaptor added to a double-stranded target nucleic acid
`
`sequenceincluding, but not limited to a T-overhang, an A-overhang, a CG overhang, a
`
`blunt end, or any other ligatable sequence.
`
`In some embodiments, the SMI ligation
`
`adaptor may be made using a method for A-tailing or T-tailing with polymerase
`
`extension; creating an overhang with a different enzyme; using a restriction enzyme to
`
`create a single or multiple nucleotide overhang, or any other method knownin the art.
`
`[0031]
`
`According to the embodiments described herein,
`
`the SMI adaptor
`
`molecule may include at least two PCR primer or “flow cell” binding sites: a forward
`
`PCR primer binding site (or a “flow cell 1” (FC1) binding site); and a reverse PCR primer
`
`binding site (or a “flow cell 2” (FC2) binding site). The SMI adaptor molecule may also
`
`include at
`
`least
`
`two sequencing primer binding sites, each corresponding to a
`
`sequencing read. Alternatively, the sequencing primer binding sites may be added in a
`
`-11-
`
`PGDX EX. 1006
`Page 13 of 70
`
`PGDX EX. 1006
`Page 13 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`separate step by inclusion of the necessary sequencesastails to the PCR primers, or
`
`by ligation of the needed sequences. Therefore, if a double-stranded target nucleic acid
`
`molecule has an SMI adaptor molecule ligated to each end, each sequencedstrand will
`
`have two reads - a forward and a reverse read.
`
`Double-stranded SMI sequences
`
`[0032]
`
`Adaptor 1 (shown below) is a Y-shaped SMI adaptor as described above
`
`(the SMI sequenceis shown as X’s in the top strand (a 4-mer), with the complementary
`
`bottom strand sequence shownas Y’s):
`
`\
`
`\
`
`Fee
`
`(Adaptor 1)
`
`[0033]
`
`Adaptor 2 (shown below)is a hairpin (or “U-shaped”) linker:
`
`wma
`
`annef
`
`(Adaptor 2)
`
`Following ligation of both adaptors to a double-stranded target nucleic acid,
`[0034]
`the following is structure is obtained:
`SCL
`
`\
`
`OeEERDNB\
`
`weenYEYY--~----BLA----=F :
`
`-12-
`
`PGDX EX. 1006
`Page 14 of 70
`
`PGDX EX. 1006
`Page 14 of 70
`
`
`
`WO2013/142389
`
`PCT/US2013/032665
`
`[0035]
`
`When melted, the product will be of the following form (where “linker” is
`
`the sequenceof adaptor 2):
`
`FC\-------XNNS------BNA---------inker--------- DWS----.-FYYYFE
`
`[0036]
`
`This product is then PCR amplified. The reads will yield:
`
`Read is
`XRXUNS----- BNA---
`
`=
`x
`Read S (note that read 2 is sees: as the complement of the bases sequenced:}
`NXEN-----BNA-—--
`
`[0037]
`
`The sequencesof the two duplex strands seen in the two sequence reads
`
`may then be compared, and sequenceinformation and mutations will be scored only if
`
`the sequenceat a given position matchesin both of the reads.
`
`[0038]
`
`This approach does notstrictly require the use of an SMI tag, as the
`
`sheared ends can