`massively parallel sequencing
`
`Isaac Kinde, Jian Wu, Nick Papadopoulos, Kenneth W. Kinzler1, and Bert Vogelstein1
`
`The Ludwig Center for Cancer Genetics and Therapeutics and The Howard Hughes Medical Institute, Johns Hopkins Kimmel Cancer Center,
`Baltimore, MD 21231
`
`Contributed by Bert Vogelstein, April 19, 2011 (sent for review March 21, 2011)
`
`The identification of mutations that are present in a small fraction
`of DNA templates is essential for progress in several areas of bio-
`medical research. Although massively parallel sequencing instruments
`are in principle well suited to this task, the error rates in such instru-
`ments are generally too high to allow confident identification of rare
`variants. We here describe an approach that can substantially increase
`the sensitivity of massively parallel sequencing instruments for this
`purpose. The keys to this approach, called the Safe-Sequencing System
`(“Safe-SeqS”), are (i) assignment of a unique identifier (UID) to each
`template molecule, (ii) amplification of each uniquely tagged template
`molecule to create UID families, and (iii) redundant sequencing of
`the amplification products. PCR fragments with the same UID are con-
`sidered mutant (“supermutants”) only if ≥95% of them contain the
`identical mutation. We illustrate the utility of this approach for deter-
`mining the fidelity of a polymerase, the accuracy of oligonucleotides
`synthesized in vitro, and the prevalence of mutations in the nuclear
`and mitochondrial genomes of normal cells.
`diagnostics | early diagnosis | biomarkers | genetics | cancer
`Genetic mutations underlie many aspects of life and death—
`
`through evolution and disease, respectively. Accordingly,
`their measurement is critical to several fields of research. Luria
`and Delbrück’s classic fluctuation analysis is a prototypic example
`of the insights into biological processes that can be gained simply
`by counting the number of mutations in carefully controlled
`experiments (1). Counting de novo mutations in humans, not
`present in their parents, has similarly led to new insights into the
`rate at which our species can evolve (2, 3). Similarly, counting
`genetic or epigenetic changes in tumors can inform fundamental
`issues in cancer biology (4). Mutations lie at the core of current
`problems in managing patients with viral diseases such as AIDS
`and hepatitis by virtue of the drug resistance they can cause (5, 6).
`Detection of such mutations, particularly at a stage before their
`becoming dominant in the population, will likely be essential to
`optimize therapy. Detection of donor DNA in the blood of organ
`transplant patients is an important indicator of graft rejection and
`detection of fetal DNA in maternal plasma can be used for pre-
`natal diagnosis in a noninvasive fashion (7, 8). In neoplastic dis-
`eases, which are all driven by somatic mutations, the applications
`of rare mutant detection are manifold; they can be used to help
`identify residual disease at surgical margins or in lymph nodes, to
`follow the course of therapy when assessed in plasma, and to
`identify patients with early, surgically curable disease when eval-
`uated in stool, sputum, plasma, and other bodily fluids (9–11).
`These examples highlight the importance of identifying rare
`mutations for both basic and clinical research. Accordingly, in-
`novative ways to assess them have been devised over the years.
`The first methods involved biologic assays based on prototrophy,
`resistance to viral infection or drugs, or biochemical assays (1, 12–
`18). Molecular cloning and sequencing provided a new dimension
`to the field, as they allowed the type of mutation, rather than
`simply its presence, to be identified (19–24). Some of the most
`powerful of these newer methods are based on digital PCR, in
`which individual molecules are assessed one by one (25). Digital
`PCR is conceptually identical to the analysis of individual clones
`
`of bacteria, cells, or virus, but is performed entirely in vitro with
`defined, inanimate reagents. Several implementations of digital
`PCR have been described, including the analysis of molecules
`arrayed in multiwell plates, in polonies, in microfluidic devices,
`and in water-in-oil emulsions (25–30). In each of these technol-
`ogies, mutant templates are identified through their binding to
`oligonucleotides specific for the potentially mutant base.
`Massively parallel sequencing represents a particularly powerful
`form of digital PCR in that hundreds of millions of template mol-
`ecules can be analyzed one by one. It has the advantage over con-
`ventional digital PCR methods in that multiple bases can be queried
`sequentially and easily in an automated fashion. However, mas-
`sively parallel sequencing cannot generally be used to detect rare
`variants because of the high error rate associated with the se-
`quencing process. For example, with the commonly used Illumina
`sequencing instruments, this error rate varies from ∼1% (31, 32)
`to ∼0.05% (33, 34), depending on factors such as the read length
`(35), use of improved base-calling algorithms (36–38), and the type
`of variants detected (39). Some of these errors presumably result
`from mutations introduced during template preparation, during the
`preamplification steps required for library preparation, and during
`further solid-phase amplification on the instrument itself. Other
`errors are due to base misincorporation during sequencing and base-
`calling errors. Advances in base calling can enhance confidence
`(e.g., refs. 36–39), but instrument-based errors are still limiting,
`particularly in clinical samples wherein the mutation prevalence can
`be ≤0.01% (11). In the work described herein, we show how tem-
`plates can be prepared and the sequencing data obtained from them
`more reliably interpreted, so that relatively rare mutations can be
`identified with commercially available instruments.
`
`Results
`Overview. Our approach, called the Safe-Sequencing System
`(“Safe-SeqS”), involves two basic steps (Fig. 1). The first is the
`assignment of a unique identifier (UID) to each DNA template
`molecule to be analyzed. The second is the amplification of each
`uniquely tagged template, so that many daughter molecules with
`the identical sequence are generated (defined as a UID family). If
`a mutation preexisted in the template molecule used for ampli-
`fication, that mutation should be present in every daughter mol-
`ecule containing that UID (barring any subsequent replication or
`sequencing errors). A UID family in which at least 95% of family
`members have the identical mutation is called a “supermutant”.
`Mutations not occurring in the original templates, such as those
`occurring during the amplification steps or through errors in base
`calling, should not give rise to supermutants. Conceptual and
`
`Author contributions: I.K., N.P., K.W.K., and B.V. designed research; I.K., J.W., N.P., and B.V.
`performed research; I.K., J.W., N.P., K.W.K., and B.V. contributed new reagents/analytic tools;
`I.K., N.P., K.W.K., and B.V. analyzed data; and I.K. and B.V. wrote the paper.
`
`The authors declare no conflict of interest.
`1To whom correspondence may be addressed: E-mail: kinzlke@jhmi.edu or bertvog@
`gmail.com.
`
`This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
`1073/pnas.1105422108/-/DCSupplemental.
`
`9530–9535 | PNAS |
`
`June 7, 2011 | vol. 108 | no. 23
`
`www.pnas.org/cgi/doi/10.1073/pnas.1105422108
`
`PGDX EX. 1004
`Page 1 of 16
`
`
`
`*
`
`*
`
`* ****
`
`*
`
`Shear
`
`Ligate Adapters
`
`Solid Phase Capture
`
`Library Amplification
`
`Mutant
`
`**
`
`*****
`
`WT
`
`UID Assignment
`
`Amplification
`
`Redundant
`Sequencing
`
`*
`
`*
`
`*
`*
`
`*
`
`Redundant Sequencing
`
`GENETICS
`
`Safe-SeqS with endogenous UIDs plus capture. The sequences of
`Fig. 2.
`the ends of each fragment produced by random shearing (variously colored
`bars) serve as the unique identifiers (UIDs). These fragments are ligated to
`adapters (yellow and orange bars) so they can subsequently be amplified by
`PCR. One uniquely identifiable fragment is produced from each strand of
`the double-stranded template; only one strand is shown. Fragments of
`interest are captured on a solid phase containing oligonucleotides com-
`plementary to the sequences of interest. Following PCR amplification to
`produce UID families with primers containing 5′ “grafting” sequences
`(black and red bars), sequencing is performed and supermutants are de-
`fined as in Fig. 1.
`
`A strategy using endogenous UIDs was also used to reduce
`false-positive mutations upon deep sequencing of a single region
`of interest. In this case, a library prepared as described above from
`∼1,750 normal cells was used as template for inverse PCR using
`primers complementary to a gene of interest, so the PCR products
`could be directly used for sequencing (Fig. S1). With conventional
`analysis, an average of 2.3 × 10−4 mutations/bp were observed,
`similar to that observed in the capture experiment (Table 1).
`Given that only 1,057 independent molecules from normal cells
`were assessed in this experiment, as determined through Safe-
`SeqS analysis, all mutations observed with conventional analysis
`likely represented false positives (Table 1). With Safe-SeqS anal-
`ysis of
`the same data, no supermutants were identified at
`any position.
`
`Table 1. Safe-SeqS with endogenous UIDs
`
`Conventional analysis
`High-quality base pairs
`Mean high-quality base pairs
`read depth
`Mutations identified
`Mutations/bp
`Safe-SeqS analysis
`High-quality base pairs
`Mean high-quality base pairs
`read depth
`UID families
`Average no. of members/UID family
`Median no. of members/UID family
`Supermutants identified
`Supermutants/bp
`
`Capture
`
`Inverse PCR
`
`106,958,863
`38,620×
`
`1,041,346,645
`2,085,600×
`
`25,563
`2.4E-04
`
`234,352
`2.3E-04
`
`106,958,863
`38,620×
`
`1,041,346,645
`2,085,600×
`
`69,505
`40
`19
`8
`3.5E-06
`
`1,057
`21,688
`4
`0
`0.0
`
`Fig. 1. Essential elements of Safe-SeqS. In the first step, each fragment to
`be analyzed is assigned a unique identification (UID) DNA sequence (green
`or blue bars). In the second step, the uniquely tagged fragments are am-
`plified, producing UID families, each member of which has the same UID. A
`supermutant is defined as a UID family in which ≥95% of family members
`have the same mutation.
`
`practical issues related to UID assignment and supermutants are
`discussed in detail in SI Materials and Methods.
`
`Endogenous UIDs. UIDs, sometimes called barcodes or indexes,
`can be assigned to nucleic acid fragments using a variety of
`methods. These methods include the introduction of exogenous
`sequences through PCR (40, 41) or ligation (42, 43). Even more
`simply, randomly sheared genomic DNA inherently contains
`UIDs consisting of the sequences of the two ends of each sheared
`fragment (Fig. 2 and Fig. S1). Paired-end sequencing of these
`fragments yields UID families that can be analyzed as described
`above. To use such endogenous UIDs in Safe-SeqS, we used two
`separate approaches: one designed to evaluate many genes si-
`multaneously and the other designed to evaluate a single gene
`fragment in depth (Fig. 2 and Fig. S1, respectively).
`For the evaluation of multiple genes, we ligated standard Illu-
`mina sequencing adapters to the ends of sheared DNA fragments
`to produce a standard sequencing library and then captured genes
`of interest on a solid phase (44). In this experiment, a library made
`from the DNA of ∼15,000 normal cells was used, and 2,594 bp
`from six genes were targeted for capture. After excluding known
`single-nucleotide polymorphisms, 25,563 apparent mutations,
`corresponding to 2.4 × 10−4 mutations/bp, were also identified
`(Table 1). On the basis of previous analyses of mutation rates in
`human cells, at least 90% of these apparent mutations were likely
`to represent mutations introduced during template and library
`preparation or base-calling errors. Note that the error rate de-
`termined here (2.4 × 10−4 mutations/bp) is considerably lower than
`usually reported in experiments using the Illumina instrument
`because we used very stringent criteria for base calling (SI Materials
`and Methods).
`With Safe-SeqS analysis of the same data, we determined that
`69,505 original template molecules were assessed in this experiment
`(i.e., 69,505 UID families, with an average of 40 members per family,
`were identified) (Table 1). All of the polymorphic variants identified
`by conventional analysis were also identified by Safe-SeqS. However,
`only eight supermutants were observed among these families, cor-
`responding to 3.5 × 10−6 mutations/bp. Thus, Safe-SeqS decreased
`the presumptive sequencing errors by at least 70-fold.
`
`Kinde et al.
`
`PNAS |
`
`June 7, 2011 | vol. 108 | no. 23 | 9531
`
`PGDX EX. 1004
`Page 2 of 16
`
`
`
`residual, unused UID assignment primers are removed by di-
`gestion with a single strand-specific exonuclease, without further
`purification, and two new primers are added. The new primers,
`complementary to the tails introduced in the UID assignment
`cycles, contain grafting sequences at their 5′ ends, permitting solid-
`phase amplification on the Illumina instrument, and phosphor-
`othioate residues at their 3′ ends to make them resistant to any
`remaining exonuclease. Following 25 additional cycles of PCR, the
`products are loaded on the Illumina instrument. As shown below,
`this strategy allowed us to evaluate the majority of input fragments
`and was used for several illustrative experiments.
`
`Analysis of DNA Polymerase Fidelity. Measurement of the error rates
`of DNA polymerases is essential for their characterization and
`dictates the situations in which these enzymes can be used. We
`chose to measure the error rate of Phusion polymerase, as this
`polymerase has one of the lowest reported error frequencies of any
`commercially available enzyme and therefore poses a particular
`challenge for an in vitro-based approach. We first amplified a sin-
`gle human DNA template molecule, comprising a segment of an
`arbitrarily chosen human gene, through 19 rounds of PCR. The
`PCR products from these amplifications, in their entirety, were
`used as templates for Safe-SeqS as described in Fig. 3. In seven
`independent experiments of this type, the number of UID families
`identified by sequencing was 624,678 ± 421,274, which is consistent
`with an amplification efficiency of 92 ± 9.6% per round of PCR.
`The error rate of Phusion polymerase, estimated through cloning
`of PCR products encoding β-galactosidase in plasmid vectors and
`transformation into bacteria, is reported by the manufacturer to be
`4.4 × 10−7errors/bp/PCR cycle. Even with very high-stringency base
`calling, conventional analysis of the Illumina sequencing data
`revealed an apparent error rate of 9.1 × 10−6 errors/bp/PCR cycle,
`more than an order of magnitude higher than the reported Phusion
`polymerase error rate (Table 2, polymerase fidelity). In contrast,
`Safe-SeqS of the same data revealed an error rate of 4.5 ×
`10−7errors/bp/PCR cycle, nearly identical to that measured for
`Phusion polymerase in biological assays (Table 2, polymerase
`fidelity). The vast majority (>99%) of these errors were single-base
`substitutions (Table S1, polymerase fidelity), consistent with pre-
`vious data on the mutation spectra created by other prokaryotic
`DNA polymerases (15, 46, 47).
`Safe-SeqS also allowed a determination of the total number of
`distinct mutational events and an estimation of PCR cycle in
`which the mutation occurred. There were 19 cycles of PCR per-
`formed in wells containing a single template molecule in these
`experiments. If a polymerase error occurred in cycle 19, there
`would be only one supermutant produced (from the strand con-
`taining the mutation). If the error occurred in cycle 18, there
`should be two supermutants (derived from the mutant strands
`produced in cycle 19), etc. Accordingly, the cycle in which the
`error occurred is related to the number of supermutants con-
`taining that error. The data from seven independent experiments
`demonstrate a relatively consistent number of observed total
`polymerase errors (2.2 ± 1.1 × 10−6 distinct mutations/bp), in
`reasonable agreement with the number expected from simula-
`tions (1.5 ± 0.21 × 10−6 distinct mutations/bp, detailed in SI
`Materials and Methods). The data also show a highly variable
`timing of occurrence of polymerase errors among experiments
`(Table S2), as predicted from classic fluctuation analysis (1). This
`kind of information is difficult to derive using conventional anal-
`ysis of the same next-generation sequencing data, in part because
`of the prohibitively high apparent mutation rate noted above.
`
`Analysis of Oligonucleotide Composition. A small number of mis-
`takes during the synthesis of oligonucleotides from phoshoramidite
`precursors are tolerable for most applications, such as routine PCR
`or cloning. However, for synthetic biology, wherein many oligonu-
`cleotides must be joined together, such mistakes present a major
`
`Exogenous UIDs. Although the results described above show that
`Safe-SeqS can increase the reliability of massively parallel sequenc-
`ing, the number of different molecules that can be examined using
`endogenous UIDs is limited. For fragments sheared to an average
`size of 150 bp (range 125–175), 36-base paired-end sequencing can
`evaluate a maximum of ∼7,200 different molecules containing
`a specific mutation (2 reads × 2 orientations × 36 bases/read × 50-
`base variation on either end of the fragment). In practice, the actual
`number of UIDs is smaller because the shearing process is not
`entirely random.
`To make more efficient use of the original templates, we de-
`veloped a Safe-SeqS strategy that used a minimum number of
`enzymatic steps. This strategy also permitted the use of degraded
`or damaged DNA, such as found in clinical specimens or after
`bisulfite treatment for the examination of cytosine methylation
`(45). As depicted in Fig. 3, this strategy employs two sets of PCR
`primers. The first set is synthesized with standard phosphoramidite
`precursors and contained sequences complementary to the gene of
`interest on the 3′ end and different tails at the 5′ ends of both the
`forward and reverse primers. The different tails allowed universal
`amplification in the next step. Finally, there was a stretch of 12–14
`random nucleotides between the tail and the sequence-specific
`nucleotides in the forward primer (40). The random nucleotides
`form the UIDs. An equivalent way to assign UIDs to fragments, not
`used in this study, would employ 10,000 forward primers and
`10,000 reverse primers synthesized on a microarray. Each of these
`20,000 primers would have gene-specific primers at their 3′ ends
`and one of 10,000 specific, predetermined, nonoverlapping UID
`sequences at their 5′ ends, allowing for 108 [i.e., (104)2] possible
`UID combinations. In either case, two cycles of PCR are per-
`formed with the primers and a high-fidelity polymerase, producing
`a uniquely tagged, double-stranded DNA fragment from each of
`the two strands of each original template molecule (Fig. 3). The
`
`**
`
`**
`**
`
`**
`
`**
`
`*****
`*****
`
`UID Assignment Cycle #1
`
`UID Assignment Cycle #2
`
`Library Amplification
`
`Redundant Sequencing
`
`Safe-SeqS with exogenous UIDs. DNA (sheared or unsheared) is am-
`Fig. 3.
`plified with a set of gene-specific primers. One of the primers has a random
`DNA sequence (e.g., a set of 14 Ns) that forms the unique identifier (UID)
`(variously colored bars), located 5′ to its gene-specific sequence, and both
`have sequences that permit universal amplification in the next step (yellow
`and orange bars). Two UID assignment cycles produce two fragments—each
`with a different UID—from each double-stranded template molecule, as
`shown. Subsequent PCR with universal primers, which also contain “grafting”
`sequences (black and red bars), produces UID families that are directly se-
`quenced. Supermutants are defined as in the legend to Fig. 1.
`
`9532 | www.pnas.org/cgi/doi/10.1073/pnas.1105422108
`
`Kinde et al.
`
`PGDX EX. 1004
`Page 3 of 16
`
`
`
`GENETICS
`
`S3), which were distributed in the expected stochastic pattern
`among replicate experiments. The number of errors in the oligo-
`nucleotides synthesized with phosphoramidites was ∼60 times
`higher than that in the equivalent products synthesized by Phusion
`polymerase. These data, in toto, indicate that the vast majority of
`errors in the former were generated during their synthesis rather
`than during the Safe-SeqS procedure.
`Does Safe-SeqS preserve the ratio of mutant:normal sequences in
`the original templates? To address this question, we synthesized two
`31-base oligonucleotides of identical sequence with the exception of
`nucleotide 15 (50:50 C/G instead of T) and mixed them at nominal
`mutant/normal fractions of 3.3% and 0.33%. Through Safe-SeqS
`analysis of the oligonucleotide mixtures, we found that the ratios
`were 2.8% and 0.27%, respectively. We conclude that the UID as-
`signment and amplification procedures used in Safe-SeqS do not
`greatly alter the proportion of variant sequences and thereby provide
`a reliable estimate of that proportion when unknown. This conclu-
`sion is also supported by the reproducibility of variant fractions when
`analyzed in independent Safe-SeqS experiments (Fig. S2A).
`
`Analysis of DNA Sequences from Normal Human Cells. The exogenous
`UID strategy (Fig. 3) was then used to determine the prevalence of
`rare mutations in a small region of the CTNNB1 gene isolated
`from ∼100,000 normal human cells from three unrelated individ-
`uals. Through comparison with the number of UID families
`obtained in the Safe-SeqS experiments (Table 2, CTNNB1 muta-
`tions in DNA from normal human cells), we calculated that the
`majority (78 ± 9.8%) of the input fragments were converted into
`UID families. There was an average of 68 members/UID family,
`easily fulfilling the required redundancy for Safe-SeqS (Fig. S3).
`Conventional analysis of the Illumina sequencing data revealed an
`average of 118,488 ± 11,357 mutations among the ∼560 Mb of
`sequence analyzed per sample, corresponding to an apparent mu-
`tation prevalence of 2.1 ± 0.16 × 10−4 mutations/bp (Table 2,
`CTNNB1 mutations in DNA from normal human cells). Only an
`average of 99 ± 78 supermutants were observed in the Safe-SeqS
`analysis. The vast majority (>99%) of supermutants were single-
`base substitutions and the calculated mutation rate was 9.0 ± 3.1 ×
`10−6 mutations/bp (Table S1, CTNNB1 mutations in DNA from
`normal human cells). Safe-SeqS thereby reduced the apparent
`frequency of mutations in genomic DNA by at least 24-fold (Fig. 4).
`We applied the identical strategy to a short segment of mito-
`chondrial DNA isolated from ∼1,000 cells from each of seven
`unrelated individuals. Conventional analysis of the Illumina se-
`quencing libraries produced with the Safe-SeqS procedure (Fig. 3)
`revealed an average of 30,599 ± 12,970 mutations among the
`∼150 Mb of sequence analyzed per sample, corresponding to an
`apparent mutation prevalence of 2.1 ± 0.94 × 10−4 mutations/bp
`(Table 2, mitochondrial mutations in DNA from normal human
`cells). Only 135 ± 61 supermutants were observed in the Safe-
`SeqS analysis. As with the CTNNB1 gene, the vast majority of
`mutations were single-base substitutions, although occasional
`single-base deletions were also observed (Table S1, mitochondrial
`mutations in DNA from normal human cells). The calculated
`mutation rate in the analyzed segment of mtDNA was 1.4 ±
`0.68 × 10−5 mutations/bp (Table 2, mitochondrial mutations in
`DNA from normal human cells). Thus, Safe-SeqS thereby re-
`duced the apparent frequency of mutations in mitochondrial
`DNA by at least 15-fold.
`
`Discussion
`The results described above demonstrate that the Safe-SeqS ap-
`proach can substantially improve the accuracy of massively parallel
`sequencing (Tables 1 and 2). It can be implemented through either
`endogenous or exogenously introduced UIDs and can be applied to
`virtually any sample preparation workflow or sequencing platform.
`As demonstrated here, the approach can easily be used to identify
`rare mutants in a population of DNA templates, to measure poly-
`
`Table 2. Safe-SeqS with exogenous UIDs
`
`Mean
`
`SD
`
`996,855,791 64,030,757
`198,638
`22,515
`2.0E-04
`1.7E-05
`9.1E-06
`7.7E-07
`
`996,855,791 64,030,757
`624,678
`421,274
`107
`122
`197
`143
`9.9E-06
`2.3E-06
`4.5E-07
`1.0E-07
`
`Polymerase fidelity
`Conventional analysis of seven replicates
`High-quality base pairs
`Total mutations identified
`Mutations/bp
`Calculated Phusion error rate
`(errors/bp/cycle)
`Safe-SeqS analysis of seven replicates
`High-quality base pairs
`UID families
`Members/UID family
`Total supermutants identified
`Supermutants/bp
`Calculated Phusion error rate
`(errors/bp/cycle)
`CTNNB1 mutations in DNA from normal human cells
`Conventional analysis of three individuals
`High-quality base pairs
`Total mutations identified
`Mutations/bp
`Safe-SeqS analysis of three individuals
`559,334,774 66,600,749
`High-quality base pairs
`374,553
`263,105
`UID families
`68
`38
`Members/UID family
`99
`78
`Total supermutants identified
`9.0E-06
`3.1E-06
`Supermutants/bp
`Mitochondrial mutations in DNA from normal human cells
`Conventional analysis of seven individuals
`High-quality base pairs
`Total mutations identified
`Mutations/bp
`Safe-SeqS analysis of seven individuals
`High-quality base pairs
`UID families
`Members/UID family
`Total supermutants identified
`Supermutants/bp
`
`559,334,774 66,600,749
`118,488
`11,357
`2.1E-04
`1.6E-05
`
`147,673,456 54,308,546
`30,599
`12,970
`2.1E-04
`9.4E-05
`
`147,673,456 54,308,546
`515,600
`89,985
`15
`6
`135
`61
`1.4E-05
`6.8E-06
`
`obstacle to success. Clever strategies for making the gene con-
`struction process more efficient have been devised (48, 49), but all
`such strategies would benefit from more accurate synthesis of the
`oligonucleotides themselves. Determining the number of errors in
`synthesized oligonucleotides is difficult because the fraction of oli-
`gonucleotides containing errors can be lower than the sensitivity of
`conventional next-generation sequencing analyses.
`To determine whether Safe-SeqS could be used for this de-
`termination, we used standard phosphoramidite chemistry to syn-
`thesize an oligonucleotide containing 31 bases that were designed
`to be identical to that analyzed in the polymerase fidelity experi-
`ment described above. In the synthetic oligonucleotide, the 31 ba-
`ses were surrounded by sequences complementary to primers that
`could be used for the UID assignment steps of Safe-SeqS (Fig. 3).
`By performing Safe-SeqS on ∼300,000 oligonucleotide templates,
`we found that there were 8.9 ± 0.28 × 10−4 supermutants/bp and
`that these errors occurred throughout the sequence of the oligo-
`nucleotides (Fig. S2A). The oligonucleotides contained a large
`number of insertion and deletion errors, representing 8.2 ± 0.63%
`and 25 ± 1.5% of the total supermutants, respectively. Importantly,
`both the position and the nature of the errors were highly re-
`producible among seven independent replicates of this experiment
`performed on the same batch of oligonucleotides (Fig. S2A). This
`nature and distribution of errors had little in common with that of
`the errors produced by Phusion polymerase (Fig. S2B and Table
`
`Kinde et al.
`
`PNAS |
`
`June 7, 2011 | vol. 108 | no. 23 | 9533
`
`PGDX EX. 1004
`Page 4 of 16
`
`
`
`UID approaches (Fig. 2 and Fig. S1) and the one described by
`Travers et al. are not ideally suited for this purpose because of the
`inevitable losses of template molecules during the ligation and
`other preparative steps.
`How do we know that the mutations identified by conventional
`analyses in the current study represent artifacts rather than true
`mutations in the original templates? Strong evidence supporting
`this is provided by the observation that the mutation prevalence in
`all but one experiment was similar: 2.0 × 10−4–2.4 × 10−4 muta-
`tions/bp (Tables 1 and 2). The exception was the experiment with
`oligonucleotides synthesized from phosphoramidites, in which the
`error of the synthetic process was apparently higher than the error
`rate of conventional Illumina analysis when used with stringent
`base-calling criteria. In contrast, the mutation prevalence of Safe-
`SeqS varied much more, from 0.0 to 1.4 × 10−5 mutations/bp,
`depending on the template and experiment. Moreover, the mu-
`tation prevalence measured by Safe-SeqS in the most controlled
`experiment, in which polymerase fidelity was measured (Table 2,
`polymerase fidelity), was almost identical to that predicted from
`previous experiments in which polymerase fidelity was measured
`by biological assays. Our measurements of mutation prevalence in
`the DNA from normal cells are consistent with some previous
`experimental data. However, estimates of these prevalences vary
`widely and may depend on cell type and sequence analyzed (SI
`Materials and Methods). We therefore cannot be certain that the
`relatively low number of mutations revealed by Safe-SeqS repre-
`sented errors occurring during the sequencing process rather than
`true mutations present in the original DNA templates. Potential
`sources of error in the Safe-SeqS process are described in SI
`Materials and Methods.
`Like all techniques, Safe-SeqS has limitations. For example, we
`have demonstrated that the exogenous UIDs strategy can be used
`to analyze a single amplicon in depth. This technology may not be
`applicable to situations wherein multiple amplicons must be ana-
`lyzed from a sample containing a limited number of templates.
`Multiplexing in the UID assignment cycles (Fig. 3) may provide
`a solution to this challenge. A second limitation is that the effi-
`ciency of amplification in the UID assignment cycles is critical for
`the success of the method. Clinical samples can contain inhibitors
`that reduce the efficiency of this step. This problem can pre-
`sumably be overcome by performing more than two cycles in the
`UID assignment PCR step (Fig. 3), although this would complicate
`the determination of the number of templates analyzed. The
`specificity of Safe-SeqS is currently limited by the fidelity of the
`polymerase used in the UID assignment PCR step, i.e., 8.8 × 10−7
`mutations/bp in its current implementation with two cycles. In-
`creasing the number of cycles in the UID assignment PCR step to
`five would decrease the overall specificity to ∼2 × 10−6 mutations/
`bp. However, this specificity can be increased by requiring more
`than one supermutant for mutation identification—the probability
`of introducing the same artifactual mutation twice or three times
`would be exceedingly low [(2 × 10−6)2 or (2 × 10−6)3, respectively].
`In sum, there are several simple ways to vary the Safe-SeqS pro-
`cedure and analysis to realize the needs of specific experiments.
`Luria and Delbrück, in their classic paper in 1943, wrote that
`their “prediction cannot be verified directly, because what we
`observe, when we count the number of resistant bacteria in a cul-
`ture, is not the number of mutations which have occurred but the
`number of resistant bacteria which have arisen by multiplication of
`those which mutated, the amount of multiplication depending on
`how far back the mutation occurred” (ref. 1, p. 495). The Safe-
`SeqS procedure described here can verify such predictions because
`the number as well as the time of occurrence of each mutation can
`be estimated from the data, as noted in the experiments on poly-
`merase fidelity. In addition to templates generated by polymerases
`in vitro, the same approach can be applied to DNA from bacteria,
`viruses, and mammalian cells. We therefore expect that this
`
`29
`
`Mutation number
`
`57
`
`Indiv. 1
`Indiv. 2
`Indiv. 3
`
`85
`
`Indiv. 1
`Indiv. 2
`Indiv. 3
`
`29
`
`57
`
`85
`
`29
`
`Mutation number
`
`57
`
`85
`
`1.1
`
`0.55
`
`0
`1
`
`1.1
`
`A
`
`5.5
`
`0
`1
`
`Frequency per 10,000bp
`
`B
`
`11
`
`5.5
`
`0
`1
`
`Frequency per 10,000bp
`
`Single-base substitutions identified by conventional and Safe-SeqS
`Fig. 4.
`analysis. The exogenous UID strategy depicted in Fig. 3 was used to produce
`PCR fragments from the CTNNB1 gene of three normal, unrelated individuals.
`Mutation numbers represent one of 87 possible single-base substitutions (3
`possible substitutions/base × 29 bases analyzed). These fragments were se-
`quenced on an Illumina GA IIx instrument and analyzed in the conventional
`manner (A) or with Safe-SeqS (B). Safe-SeqS results are displayed on the same
`scale as conventional analysis for direct comparison; the Inset is a magnified
`view. Note that most of the variants identified by conventional analysis are
`likely to represent sequencing errors, as indicated by their high frequency
`relative to Safe-SeqS and their consistency among unrelated samples.
`
`merase error rates, and to judge the reliability of oligonucleotide
`syntheses. One of the advantages of the strategy is that it yields the
`number of templates analyzed as well as the fraction of templates
`containing variant bases. Previously described in vitro methods for
`the detection of small numbers of template molecules (e.g., refs. 29
`and 50) allow the fraction of mutant templates to be determined but
`cannot d