`Sequencing of Multiple Homolog Amplification Products
`by 454 Parallel Sequencing
`Jonas Binladen1.
`, M. Thomas P. Gilbert1.
`
`, Jonathan P. Bollback2, Frank Panitz3, Christian Bendixen3, Rasmus Nielsen2, Eske Willerslev1*
`
`1 Center for Ancient Genetics, Institute of Biology, University of Copenhagen, Copenhagen, Denmark, 2 Center for Bioinformatics and Institute of
`Biology, University of Copenhagen, Copenhagen, Denmark, 3 Department of Genetics and Biotechnology, Danish Institute of Agricultural Sciences
`Research Centre Foulum, Tjele, Denmark
`
`Background. The invention of the Genome Sequence 20TM DNA Sequencing System (454 parallel sequencing platform) has
`enabled the rapid and high-volume production of sequence data. Until now, however, individual emulsion PCR (emPCR)
`reactions and subsequent sequencing runs have been unable to combine template DNA from multiple individuals, as
`homologous sequences cannot be subsequently assigned to their original sources. Methodology. We use conventional PCR
`with 59-nucleotide tagged primers to generate homologous DNA amplification products from multiple specimens, followed by
`sequencing through the high-throughput Genome Sequence 20TM DNA Sequencing System (GS20, Roche/454 Life Sciences).
`Each DNA sequence is subsequently traced back to its individual source through 59tag-analysis. Conclusions. We demonstrate
`that this new approach enables the assignment of virtually all the generated DNA sequences to the correct source once
`sequencing anomalies are accounted for (miss-assignment rate,0.4%). Therefore, the method enables accurate sequencing
`and assignment of homologous DNA sequences from multiple sources in single high-throughput GS20 run. We observe a bias
`in the distribution of the differently tagged primers that is dependent on the 59 nucleotide of the tag. In particular, primers 59
`labelled with a cytosine are heavily overrepresented among the final sequences, while those 59 labelled with a thymine are
`strongly underrepresented. A weaker bias also exists with regards to the distribution of the sequences as sorted by the second
`nucleotide of the dinucleotide tags. As the results are based on a single GS20 run, the general applicability of the approach
`requires confirmation. However, our experiments demonstrate that 59primer tagging is a useful method in which the
`sequencing power of the GS20 can be applied to PCR-based assays of multiple homologous PCR products. The new approach
`will be of value to a broad range of research areas, such as those of comparative genomics, complete mitochondrial analyses,
`population genetics, and phylogenetics.
`
`Citation: Binladen J, Gilbert MTP, Bollback JP, Panitz F, Bendixen C (2007) The Use of Coded PCR Primers Enables High-Throughput Sequencing of
`Multiple Homolog Amplification Products by 454 Parallel Sequencing. PLoS ONE 2(2): e197. doi:10.1371/journal.pone.0000197
`
`INTRODUCTION
`The arrival of the Genome Sequence 20TM DNA Sequencing
`System (GS20, Roche/454 Life Sciences) and associated ‘Se-
`quencing-by-Synthesis’ protocol has enabled pyrosequencing of up
`to 25 million nucleotides in a single four-hour reaction [1]. The
`method employs single molecule amplification prior to sequencing
`and therefore eliminates the need for prior cloning. In initial
`implementations of the technology random fragments from DNA
`extracts have been sequenced without a priori selection of specific
`genetic regions. As such, all DNA that is present in the sample has
`a chance of being amplified and sequenced that approximately
`correspond to its frequency within the DNA extract. The method
`has proven an efficient tool for use in a number of specific cases,
`such as the rapid sequencing of relatively small genomes [1,2].
`For purposes such as comparative genomics, mitochondrial
`sequencing, and population genetics, it is of interest to combine
`the selectivity of primer-based PCR, with the sequencing power of
`the GS20 platform. The simplest way to achieve this is the use of
`the GS20 to emulsion PCR (emPCR) then pyrosequence the
`products of individual PCR reactions. Due to the sequencing
`power of the GS20 this approach results in hundreds of thousands
`of individual sequences from a single PCR reaction, each derived
`directly from a single original template within the reaction [1]. As
`such, this result is similar to the generation of sequence data
`through conventional cloning. We henceforth term the GS20
`derived sequences as single molecule sequences. Obviously,
`in
`many studies the amount of single molecule sequences produced
`by single GS20 runs is unnecessary and economically unfeasible,
`
`unless several PCR products can be processed simultaneously and
`correctly assigned.
`Thomas and co-authors [3] recently took the first step in mak-
`ing this possible by pooling together eleven PCR products, each
`targeting different regions of the genome, into single sequencing-
`by-synthesis reactions using the GS20. In this case, the authors
`could easily sort the sequence data due to the unique genetic
`sequence of each original target. Furthermore, by sequencing the
`
`Academic Editor: Matthew Hahn, Indiana University, United States of America
`
`Received November 27, 2006; Accepted January 16, 2007; Published February
`14, 2007
`
`Copyright: ß 2007 Binladen et al. This is an open-access article distributed
`under the terms of the Creative Commons Attribution License, which permits
`unrestricted use, distribution, and reproduction in any medium, provided the
`original author and source are credited.
`
`Funding: JB and EW were supported by the Wellcome Trust, UK, the Carlsberg
`Foundation, DK, and the National Science Foundation, DK. MTPG acknowledges
`the Marie Curie Actions FP6-MEIF-CT-2005-025002 ‘FORMAPLEX’ grant for funding
`his research. JPB and RN were funded by the Danish FSS and the National Science
`Foundation, DK. None of the sponsors or funders have had any influence on the
`data or manuscript presented here.
`
`Competing Interests: The authors have declared that no competing interests
`exist.
`
`* To whom correspondence should be addressed. E-mail: ewillerslev@bi.ku.dk
`
`. These authors contributed equally to this work.
`
`PLoS ONE | www.plosone.org
`
`1
`
`February 2007 |
`
`Issue 2 | e197
`
`Ariosa Exhibit 1005, p. 1
`
`
`
`59 Primer Tags on the GS20
`
`combined PCR products from separate individual specimens on
`specially partitioned fragments
`(1/8 sections) of
`the GS20
`PicoTitrePlateTM,
`they were rapidly able to generate large
`numbers of sequences from each of the eleven PCR products
`(<1,000 per product) [3].
`While this represents an excellent advance in the exploitation of
`the GS20, in theory the combined ‘‘primer specific PCR/GS20’’
`approach can be enhanced even further. For example, the number
`of sequences generated in even an 1/8th run of the GS20 using
`a 40675 PicoTitrePlateTM (currently the smallest commercially
`available subdivision of a single GS20 reaction) is large; in our
`experience such a run routinely generates at least 6,000, and more
`commonly over 10,000 sequences per run. With an estimated 10-
`fold coverage, using the method of Thomas et al [3] this could
`enable the pooling of 600 PCR products in a single reaction.
`However, the subsequent identification of the sequence reads
`would require either the pooling of 600 PCR products targeting
`unique genetic regions, or, if multiple homologous PCR products
`were to be co-sequenced (i.e. multiple different products amplified
`using a single identical primer pair), an a prior knowledge about the
`exact sequence of each target.
`In this paper we have overcome this problem, presenting
`a method where initial PCR primers are 59-tagged with short
`
`nucleotide sequences (tags) in such a way that a unique tagged
`primer combination can be applied to each specific DNA template
`source. As sequences generated by the GS20 commence at the
`very first position of the source DNA fragment, the tags are
`observed in the generated sequences. Therefore sequences can
`rapidly be sorted into their original template source using the tags
`(Figure 1). Currently,
`the method provides a means for the
`simultaneous sequencing, generation of single molecule sequences,
`(,120 bp)
`and assignment of
`short
`from homologous PCR
`products obtained from multiple individuals. However, as the
`GS20 sequencing-by-synthesis
`technologies are developed to
`increase both the number, and length of the sequences generated,
`the power of this technique will likewise increase.
`
`METHODS
`In theory, a GS20 reaction that has been performed on a pool of
`different PCR products at equimolar concentration should
`generate an equal number of sequences from each PCR product.
`However, in practice it can be expected that random processes
`occurring during the procedure will result in a Poisson distributed
`relative frequency of the final products. In addition to this, the
`different 59 tags used on the primers for the initial PCR might
`potentially bias the final sequence distribution. As a result, the
`
`Figure 1. The application of 59 primer tags to the GS20 sequencing-by-synthesis process.
`doi:10.1371/journal.pone.0000197.g001
`
`PLoS ONE | www.plosone.org
`
`2
`
`February 2007 |
`
`Issue 2 | e197
`
`Ariosa Exhibit 1005, p. 2
`
`
`
`59 Primer Tags on the GS20
`
`the subsequent discrimination of 256 different products. However,
`under the current status of the sequencing technology, GS20
`sequencing reads are limited to approximately 120 bases, thus in
`this experiment the full sequence (133–141 bp including primer,
`species dependent) was not returned and our analyses were limited
`to simply discriminating using the primer at the sequence end of
`the product. Furthermore, during the GS20 process, single DNA
`fragments are mobilised to beads in either orientation (c.f. [1] for
`details). The implication of this is that approximately 50% of each
`PCR product will be sequenced from the orientation of
`the
`forward primer, and 50% from the orientation of the reverse
`primer. Hence, this made it necessary to label both the forward
`and reverse end of each PCR product.
`In addition to the above experiments, three further unique
`primer pairs were designed and used for PCR,
`that contain
`tetranucleotide tails (Table 1) in order to investigate whether an
`increased tail length affects the efficiency of the method. Increas-
`ing the tag sequence would exponentially increase the number of
`possible unique primer combinations and thus PCR reactions that
`can be incorporated into a single GS20 sequencing run.
`
`DNA samples analysed
`DNA from thirteen species was used as PCR template (Table 2).
`The target species and size of the PCR insert (excluding primers)
`impala (Aepyceros melampus) 92 bp; grey wolf
`were as follows:
`(Canis lupus) 91 bp; cheetah (Acinonyx jubatus) 91 bp; hippopotamus
`(Hippopotamus amphibious) 91 bp;
`lion (Panthera leo) 95 bp; saiga
`antelope (Saiga tartarica) 93 bp, Mueller’s Bornean gibbon (Hylobates
`muelleri) 94 bp, narwhal (Monodon monoceros) 90 bp; domestic mouse
`(Mus domesticus) 97 bp; musk ox (Ovibos moschatus) 93 bp; human
`94 bp; Burchell’s zebra (Equus burchelli) 89 bp; and African buffalo
`(Syncerus
`caffer) 94 bp. The DNA was extracted from frozen
`specimens using the DNEasy tissue extraction kit
`(Qiagen)
`following the manufacturer’s protocol. To increase the number
`of different PCR products that we could pool into the GS20-
`reaction beyond a single product from each of available thirteen
`extractions, we used individual primer pairs on several different
`extractions each (Table 2).
`
`PCR conditions
`We generated 64 differently labelled 16S mtDNA PCR fragments
`(Table 2). PCRs were performed in 25 ml PCR reactions contain-
`ing 16 PCR Buffer, 2.5 mM MgCl2 solution, 0.2 mM dNTP
`Mix, 1 U Taq DNA Polymerase, 1 mM each primer and 1 ml
`DNA extract. Cycling was performed using a Mastercycler
`Gradient Thermal Cycler (Eppendorf) with the following cycle
`program: Initial denaturation at 94uC for 2 minutes followed by
`25 cycles of 94uC for 30 seconds, 56uC for 30 seconds and 72uC
`for 30 seconds,
`followed by a final extension of 8 minutes at
`72uC. Five ml of the PCR products were visualised on 2% agarose
`gels using ethidium bromide staining and UV light
`trans-
`illumination. Positive PCR products were purified using the
`Invisorb Spin PCRapid kit
`(Invitek) and quantified using
`a Nanodrop ND-1000 (Nanodrop Technologies). Quantification
`was performed directly on the purified PCR products (that is,
`without dilution). Several duplicate measurements indicated that
`intrasample measurement variation was negligible. Purified yields
`were between 3.8–26.1 ng/ml
`(Supplementary Table S1). Sub-
`sequently the PCR products were pooled together. The PCR
`products were at equimolar concentrations (26.1 ng each) with
`two exceptions; amplification products from the buffalo were
`added at double concentration (52.2 ng), and PCR products
`generated from the zebra template used twice the number of
`
`incorporation of too many different PCR products in a single
`GS20 reaction could result in some of them not being sequenced.
`In contrast, the incorporation of too few PCR products in a single
`454 parallel sequencing run minimises the efficiency, and cost
`efficacy of the method. Furthermore, as one advantage of the
`approach is the generation of single molecule sequences, it is useful
`to empirically determine how many sequences can be expected
`from each of a set of PCR products that are pooled in equimolar
`concentration.
`We performed a test involving the analysis of a single genetic
`marker in DNA extracts from multiple different individuals to
`investigate the effectiveness of
`this method (i.e. how many
`individual PCR products can be expected to be represented
`among a set number of sequences). This was achieved using
`a single conventional pair of mammalian mitochondrial DNA
`(mtDNA) 16S primers [4]. The primers were originally designed as
`mammalian generic, and amplify an 89–97 bp fragment (133–
`141 bp including primers) that is discriminatory between mam-
`malian species. The study is thus an analogue to a likely use of the
`technique - the PCR amplification and sequencing of specific
`genetic regions from multiple individuals of a single species.
`
`59 primer tagging
`The original primers were modified into sixteen unique forward,
`and sixteen reverse primers through the addition of 59 dinucleotide
`tags (Table 1). In contrast
`to most conventional sequencing
`platforms, pyrosequencing methods (such as that used by the
`GS20) generate data from the first base of the fragment sequenced.
`Thus, the 59 tags on each primer will be apparent in the final
`sequence. The sixteen unique forward and reverse primers can be
`combined to make 16*16 = 256 unique sequence tags. In this way,
`an investment of thirty-two initial primers could in theory enable
`
`Table 1. 59 tagged PCR primers
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`Forward primers
`
`Reversed primers
`
`Name
`
`16Faa
`
`16Fac
`
`Sequence (59–39)
`
`aacggttggggtgacctcgga
`
`accggttggggtgacctcgga
`
`Name
`
`16Raa
`
`16Rac
`
`Sequence (59–39)
`
`aagctgttatccctagggtaact
`
`acgctgttatccctagggtaact
`
`16Fag
`
`agcggttggggtgacctcgga
`
`16Rag
`
`aggctgttatccctagggtaact
`
`16Fat
`
`16Fca
`
`16Fcc
`
`atcggttggggtgacctcgga
`
`cacggttggggtgacctcgga
`
`cccggttggggtgacctcgga
`
`16Rat
`
`16Rca
`
`16Rcc
`
`atgctgttatccctagggtaact
`
`cagctgttatccctagggtaact
`
`ccgctgttatccctagggtaact
`
`16Fcg
`
`cgcggttggggtgacctcgga
`
`16Rcg
`
`cggctgttatccctagggtaact
`
`16Fct
`
`16Fga
`
`16Fgc
`
`ctcggttggggtgacctcgga
`
`16Rct
`
`ctgctgttatccctagggtaact
`
`gacggttggggtgacctcgga
`
`16Rga
`
`gagctgttatccctagggtaact
`
`gccggttggggtgacctcgga
`
`16Rgc
`
`gcgctgttatccctagggtaact
`
`16Fgg
`
`ggcggttggggtgacctcgga
`
`16Rgg
`
`gggctgttatccctagggtaact
`
`16Fgt
`
`16Fta
`
`16Ftc
`
`16Ftg
`
`16Ftt
`
`gtcggttggggtgacctcgga
`
`tacggttggggtgacctcgga
`
`tccggttggggtgacctcgga
`
`tgcggttggggtgacctcgga
`
`ttcggttggggtgacctcgga
`
`16Rgt
`
`16Rta
`
`16Rtc
`
`16Rtg
`
`16Rtt
`
`gtgctgttatccctagggtaact
`
`tagctgttatccctagggtaact
`
`tcgctgttatccctagggtaact
`
`tggctgttatccctagggtaact
`
`ttgctgttatccctagggtaact
`
`16SF4a
`
`gctacggttggggtgacctcgga
`
`16SR4a
`
`gtacgctgttatccctagggtaact
`
`16SF4b
`
`tcagcggttggggtgacctcgga
`
`16SR4b
`
`tgacgctgttatccctagggtaact
`
`16SF4c
`
`ctagcggttggggtgacctcgga
`
`16SR4c
`
`tagcgctgttatccctagggtaact
`
`doi:10.1371/journal.pone.0000197.t001
`
`..........................................................................................
`
`PLoS ONE | www.plosone.org
`
`3
`
`February 2007 |
`
`Issue 2 | e197
`
`Ariosa Exhibit 1005, p. 3
`
`
`
`59 Primer Tags on the GS20
`
`0.0000
`
`0.0000
`
`0.0000
`
`0.0417
`
`0.0000
`
`0.0000
`
`0.0039
`
`0.0114
`
`0.0000
`
`0.0034
`
`0.0061
`
`0.0000
`
`0.0043
`
`0.0042
`
`0.0000
`
`0.0000
`
`0.0000
`
`0.0000
`
`0.0000
`
`0.0000
`
`0.0062
`
`0.0053
`
`0.0100
`
`0.0000
`
`0.0000
`
`0.0000
`
`0.0090
`
`0.0088
`
`0.0116
`
`0.0000
`
`0.0000
`
`0.0105
`
`0.0000
`
`0.0000
`
`0.0088
`
`0.0000
`
`0.0000
`
`0.0071
`
`1
`
`0
`
`0
`
`1
`
`0
`
`0
`
`1
`
`0
`
`0
`
`1
`
`1
`
`1
`
`0
`
`0
`
`0
`
`1
`
`1
`
`1
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`1
`
`1
`
`0
`
`2
`
`1
`
`0
`
`3
`
`1
`
`0
`
`0
`
`2
`
`0
`
`0
`
`0
`
`326
`
`305
`
`231
`
`237
`
`161
`
`202
`
`172
`
`191
`
`104
`
`118
`
`162
`
`190
`
`100
`
`AssignmentError
`
`assigned
`Incorrectly
`
`100
`
`114
`
`100
`
`114
`
`50
`
`48
`
`60
`
`45
`
`258
`
`263
`
`286
`
`291
`
`110
`
`87
`
`105
`
`111
`
`113
`
`86
`
`50
`
`50
`
`60
`
`45
`
`259
`
`266
`
`286
`
`292
`
`328
`
`305
`
`232
`
`238
`
`161
`
`202
`
`172
`
`191
`
`104
`
`118
`
`163
`
`191
`
`101
`
`110
`
`87
`
`105
`
`112
`
`114
`
`87
`
`127
`
`127
`
`93
`
`95
`
`65
`
`68
`
`113
`
`179
`
`115
`
`141
`
`93
`
`96
`
`65
`
`68
`
`114
`
`179
`
`115
`
`142
`
`74
`
`84
`
`108
`
`112
`
`101
`
`106
`
`81
`
`80
`
`2
`
`5
`
`4
`
`6
`
`3
`
`3
`
`102
`
`93
`
`82
`
`99
`
`108
`
`93
`
`96
`
`86
`
`1
`
`1
`
`assigned
`TotalCorrectly
`
`melampus
`Aepyceros
`
`caffer
`Syncerus
`
`Impala
`
`Buffalo
`African
`
`54
`
`57
`
`64
`
`21
`
`15
`
`19
`
`12
`
`32
`
`28
`
`7
`
`13
`
`19
`
`23
`
`28
`
`36
`
`20
`
`23
`
`16
`
`23
`
`1
`
`56
`
`44
`
`17
`
`17
`
`72
`
`98
`
`58
`
`69
`
`43
`
`51
`
`25
`
`29
`
`18
`
`28
`
`20
`
`15
`
`21
`
`58
`
`41
`
`49
`
`leo
`Panthera
`
`amphibius
`Hippopotamus
`
`jubatus
`Acinonyx
`
`lupus
`Canis
`
`Saiga
`
`CheetahHippopotamusLion
`
`Wolf
`
`16SR4C
`
`16SF4C
`
`16SR4B
`
`16SF4B
`
`16SR4A
`
`16SF4A
`
`16RCT
`
`16FCT
`
`16RCG
`
`16FCG
`
`16RCC
`
`16FCC
`
`16RCA
`
`16FCA
`
`16RGT
`
`16FGT
`
`16RGG
`
`16FGG
`
`16RGC
`
`16FGC
`
`16RGA
`
`16FGA
`
`16RTT
`
`16FTT
`
`16RTG
`
`16FTG
`
`16RTC
`
`16FTC
`
`16RTA
`
`16FTA
`
`16RAT
`
`16FAT
`
`16RAG
`
`16FAG
`
`16RAC
`
`16FAC
`
`16RAA
`
`16FAA
`
`Primer
`
`.......................................................................................................................................................................................................
`Table2.Assignedsequencedistribution
`
`.......................................................................................................................................................
`
`PLoS ONE | www.plosone.org
`
`4
`
`February 2007 |
`
`Issue 2 | e197
`
`54
`
`60
`
`19
`
`15
`
`55
`
`43
`
`82
`
`86
`
`96
`
`80
`
`117
`
`106
`
`54
`
`71
`
`5
`
`1
`
`6
`
`4
`
`14
`
`19
`
`1
`
`0
`
`2
`
`3
`
`1
`
`2
`
`9
`
`24
`
`26
`
`25
`
`11
`
`9
`
`4
`
`19
`
`1
`
`1
`
`1
`
`35
`
`45
`
`24
`
`31
`
`25
`
`42
`
`34
`
`43
`
`1
`
`46
`
`43
`
`61
`
`51
`
`63
`
`64
`
`54
`
`42
`
`71
`
`90
`
`61
`
`84
`
`8
`
`5
`
`65
`
`86
`
`1
`
`1
`
`1
`
`27
`
`35
`
`12
`
`31
`
`7
`
`19
`
`40
`
`49
`
`1
`
`1
`
`48
`
`55
`
`58
`
`47
`
`39
`
`burchelli
`Equus
`
`sapiens
`Homo
`
`moschatus
`Ovibos
`
`domesticus
`Mus
`
`monoceros
`Monodon
`
`muelleri
`Hylobates
`
`tartarica
`Saiga
`
`HumanZebra
`
`MuskOx
`
`Mouse
`Domestic
`
`Narwhal
`
`Gibbon
`
`Ariosa Exhibit 1005, p. 4
`
`
`
`59 Primer Tags on the GS20
`
`tags. The pooled PCR products were subsequently
`different
`analysed on the GS20 platform using the complete sample
`preparation and analytical process, as recommended by the
`manufacturer (Roche). The initial
`sample concentration was
`9.33 ng/ml and 21 ng (23 ml) was used for the reaction. No
`nebulization was performed and the average concentration of
`single stranded library was 75 pg/ml. The calculated dilution
`factor was 5.25 and sequencing was performed as a full titration
`run without bead enrichment,
`i.e. the run was performed on
`a 40675 plate, divided into 8 sectors (a titration run uses 4 of
`these sectors with different numbers of DNA molecules per bead
`i.e. 1,4,16, and 64 respectively.).
`
`Conventional sequencing of the targets
`Although the complete 16S mtDNA sequences for most of the
`species analysed is available in the public domain, we regenerated
`the target sequences for the thirteen mammal species used using
`conventional dye-labelled sequencing (Sequencing reactions and
`analyses performed on Applied Biosystems platforms by Macro-
`gen, Korea). This was to ensure that subsequent analyses did not
`mistake natural sequence variation with sequencing errors. The
`thirteen individual 16S mtDNA sequences are deposited in
`GenBank under the accession numbers EF152485–EF152497.
`
`Initial assignment of the sequence data
`As the correct association of tags and sequences is crucial to the
`approach, we followed very conservative criteria post sequencing
`in regards to acceptance of
`the sequence data. Initially, we
`discarded all sequence reads without an exact match to any of the
`primers used in the studies (Primer Mismatched Sequences).
`Subsequently,
`the identity of
`the remaining sequences were
`globally aligned to the thirteen reference sequences (Sanger-
`sequencing generated) using direct and reverse complementation.
`The global alignment was performed using ClustalW [5] used the
`following scoring scheme: matches (+5), mismatches (24), gap
`penalty (210), and a gap extension penalty (210). The latter
`penalties were not applied to end gaps. For each alignment
`a percent identity score was calculated to determine the best
`match in the following way: excluding end gaps, ambiguous states
`(Ns) in the 454 sequence, and gaps introduced in the reference
`sequence during alignment
`the number of mismatches was
`calculated.
`If a sequence differed at more than one nucleotide from the
`highest scoring alignment, then the sequences were discarded into
`a separate dataset. We refer to these sequences as Non-Assigned
`sequences, and the remaining sequences are referred to as
`Assigned sequences. The per nucleotide error rate estimated from
`this type of data is 7.561024 [6]. With reads of a length of
`<100 bp excluding primers, and primers of length 22 bp, the
`expected proportion of non-assigned sequences is then 2.761023
`and the expected proportion of primer mismatched sequences
`should be 1.661022. Any excess of Non-Assigned or Primer
`Mismatched Sequences above this level is then due to experi-
`mental errors other than sequencing errors, such as contamina-
`tion.
`The identity of the Non-Assigned sequences are of some interest
`as they may provide information regarding these other sources of
`experimental error. Thus
`the Non-assigned sequences were
`subsequently subjected to BLAST [7] analyses against the NCBI
`GenBank DNA database in order to determine their identity.
`During this (and other) BLAST analyses performed, when two or
`more hits with identical E-score were reported, we prioritised any
`that matched our 13 target sequences over others.
`
`doi:10.1371/journal.pone.0000197.t002
`Italicnumbersindicatemiss-assignedsequences.
`
`0.0083
`
`0.0000
`
`0.0026
`
`0.0000
`
`0.0866
`
`0.0000
`
`0.0024
`
`0.0043
`
`0.0045
`
`0.0047
`
`0.0000
`
`0.0000
`
`0.0023
`
`0.0000
`
`1.5385
`
`20
`
`0
`
`2
`
`0
`
`11
`
`5622432.46
`
`746
`
`782
`
`988
`
`127
`
`0
`
`279
`
`1
`
`424
`
`2
`
`1
`
`2
`
`0
`
`0
`
`1
`
`0
`
`470
`
`220
`
`422
`
`147
`
`188
`
`431
`
`398
`
`SUMMEAN
`
`0.003557453
`
`rate
`assignment
`Overallmiss-
`
`83.1
`
`83.4
`
`0.0040
`
`0.1525
`
`0.5263
`
`148.5147.9474
`
`20
`
`56425622
`
`sequences
`GS20
`Percent
`
`Mean
`
`Total
`
`error
`assignment
`Species
`
`assigned
`Incorrect
`
`Assigned
`Correctly
`
`Column:
`Analysisby
`
`AssignmentError
`
`assigned
`Incorrectly
`
`assigned
`TotalCorrectly
`
`melampus
`Aepyceros
`
`caffer
`Syncerus
`
`burchelli
`Equus
`
`sapiens
`Homo
`
`moschatus
`Ovibos
`
`domesticus
`Mus
`
`monoceros
`Monodon
`
`muelleri
`Hylobates
`
`tartarica
`Saiga
`
`leo
`Panthera
`
`amphibius
`Hippopotamus
`
`jubatus
`Acinonyx
`
`lupus
`Canis
`
`Primer
`
`Impala
`
`Buffalo
`African
`
`HumanZebra
`
`MuskOx
`
`Mouse
`Domestic
`
`Narwhal
`
`Gibbon
`
`Saiga
`
`CheetahHippopotamusLion
`
`Wolf
`
`.......................................................................................................................................................................................................
`Table2.cont.
`
`............................................................................
`
`PLoS ONE | www.plosone.org
`
`5
`
`February 2007 |
`
`Issue 2 | e197
`
`Ariosa Exhibit 1005, p. 5
`
`
`
`RESULTS
`GS20 sequences generated
`6765 DNA sequences were generated using the GS20 platform
`(Sequence Data S1). The sequence data is provided in the
`supplemental
`information. The sequencing was performed as
`a titration run with no bead enrichment and different DNA/bead
`rations, therefore the number of sequences is lower than what is
`previously reported for PCR products (8,000–12,000, [3]). As such
`the calculations of the sequencing efficiency in this study provides
`a conservative estimate of the potential power of the technique.
`
`Sequence analysis
`Primer Mismatch Sequences Due to the stringent screening
`criteria applied in this study, 458 (6.8%) of
`the 6765 initial
`sequences generated from a 1/8th of a plate run on the GS20 were
`identified as Primer Mismatch Sequences (see above). These
`grouped as follows: 377 sequences or 5.6% have sequencing errors
`in the primer sequence, 54 sequences or 0.8% have the primer
`sequence starting one position off, 3 sequences or 0.04% have the
`primer sequence starting two positions off, and 24 sequences or
`0.4% have the primer sequence starting more than two positions
`off. As the theoretically expected number of mismatches based on
`the sequencing error rate is 1.6%, other sources of error (such as
`damage to the original DNA template, sequencing errors or
`mtDNA heteroplasmy) may be affecting the results.
`The 458 Primer Mismatch Sequences were identified using
`BLAST, revealing that 395 of the sequences (86.2%) match the
`reference sequences of the study. This includes 81 sequences
`where the primers are as expected, but positioned one or more
`base pair positions off the 59 end of the sequence. Among these, 80
`sequences match DNA sequences from species used in this study
`(Supplementary Table S2). Of sequences containing errors in the
`primers 313 of 377 (83.0%) matched species used in this study
`(Supplementary Table S3).
`That so many sequences contained sequencing errors in the
`primers (n = 377) was surprising, and warranted further investiga-
`tion. The sequences could be distinguished into four broad
`categories as follows: Those that failed show any match to the
`primer sequences in general (n = 2); Those that were exact match
`to the core primer but lacked the 59 tag sequence (n = 121); Those
`that contained at least one mismatch and no indels (insertion/
`deletions)
`(n = 53); and those that contained at least one indel
`(n = 201)
`(21 of which also contained a mismatch). We sub-
`sequently investigated whether the errors may have arisen during
`the primer synthesis itself, and not during the sequencing-by-
`synthesis process. This was tested under the assumptions that a)
`errors arising during the primer synthesis process would be
`randomly distributed along the primer sequence, and that b)
`primers containing errors in the 39 four nucleotides would bind
`poorly to the template DNA, thus not enable PCR amplification.
`If this was the case, then although prior to PCR a random
`distribution of sequence errors should be observed across the
`primer sequences, post PCR significantly fewer errors should be
`observed at the 39 end of the primer. A x2 test on the distribution
`of the sequencing errors between the five 39 terminal nucleotides,
`five (middle) nucleotides, and the remaining 59
`the next
`nucleotides confirms that there are significantly fewer sequencing
`errors in the five terminal 39 nucleotides of the primers (Pearson’s
`x2 test, x2 = 17.506, p = 0.00001). Therefore the data suggests that
`at least some of the primer-related errors can be explained by
`errors during the primer synthesis itself. (We note however that
`this test was only performed on the primers that contained
`
`59 Primer Tags on the GS20
`
`mismatches without indels, due to the difficulty of accurately
`aligning the primers that contained indels).
`Assigned Sequences The remaining 6307 sequences were
`identified through a global alignment
`to the 13 reference
`sequences. Of these, 5642 sequences (89%) diverged by no more
`than 1bp from one of the reference sequences, and could thus be
`assigned to one of the taxa analysed in the study (Table 2).
`Twenty sequences (0.4%) were miss-assigned to an incorrect
`identity. Strikingly, more than half of the miss-assigned sequences
`(n = 11) are of human origin. Based on the omnipresent nature of
`human DNA in most laboratory settings, this bias is likely due to
`contamination during extractions and/or PCR setup. Ignoring all
`human sequences (n = 138), only 9 sequences could be miss-
`assigned out of a total of 5504 GS20 non-human sequences
`(0.00163 percent miss-assignment). Based on a GS20 sequencing
`rate of 761024
`[6],
`the expected number of miss-
`error
`assignments due to sequencing errors in the dinucleotide tag is
`2655046 (761024)<7.4 mismatches. Thus, the obtained result is
`are remarkably close to the expected and miss-assignments of non-
`human sequences can be explained by sequencing errors alone.
`This result shows that despite the possibility of sequencing (and
`other) errors, the assignment based on 59 tagging is remarkably
`reliable.
`Non-Assigned Sequences Of the 6307 sequences that did
`not contain a primer error, 665 sequences diverge from the
`reference sequences by more than 1 bp. However, the expected
`number of such sequences based on the known sequencing error
`rate is only 63076(2.761023)<17, suggesting a significant impact
`of other factors. Obvious candidates include the amplification of
`non-targeted genomic
`sequences
`(for
`example
`laboratory
`contamination), DNA damage or heteroplasmy in the original
`template, and errors introduced into the DNA during the initial
`PCR stage (where a non-proof reading polymerase was used). Of
`these, 491 sequences (73.8%) match DNA sequences from one of
`the 13 original
`taxa amplified by the highest BLAST hit
`(Supplementary Table S4).
`Sequence distribution On average each of the 64 amplicons
`(grouping forward and reverse reads) had 856 coverage with
`a standard deviation of 32 (Table 2). The coverage variation is
`very large. At the extremes we observe that the zebra DNA
`amplified with a TA tag generating a single forward read and no
`reverse read, while the zebra amplified with the CC tag resulting
`in more than 100 forward and reverse reads. There is no evidence
`that forward or reverse strands are unequally distributed within
`the sequence dataset (Pearson’s x2 test, x2 = 27.2793, df = 18,
`p = 0.0739).
`59 tag distribution A Pearson’s x2 test strongly rejects an
`equal distribution among the different tags (x2 = 1725.28, df = 18,
`p = 0.0). The divergence from the expected numbers are primarily
`caused by an excess of 59CN (N representing A,T,G,C) tagged
`amplicons, and a depletion of 59TN tags (Table 3), with a small
`surplus of 59GN and small depletion of 59AN tags. We also
`investigated whether the identity of the second nucleotide within
`each tag led to non-random distribution of the final sequences.
`This was achieved using x2 analysis on the 4 independent datasets
`constituted by the 59 nucleotide A, C, G and T respectively (i.e. the
`4 primer groups AN, CN, GN and TN). The results indicate that
`with the exception of the 59 T labelled tags, the sequences were
`non randomly distributed (AN, x2 = 60.0, d.f. = 3, p = 0.0; CN,
`x2 = 10.0, d.f. = 3, p = 0.0186; GN, x2 = 16.3, d.f. = 3, p = 0.0009;
`TN, x2 = 2.35, d.f. = 3, p = 0.5039). Due to the limited number of
`tetranucleotide tags analysed, we were unable to investigate the
`effect of the identity of the 3rd and 4th position nucleotides.
`
`PLoS ONE | www.plosone.org
`
`6
`
`February 2007 |
`
`Issue 2 | e197
`
`Ariosa Exhibit 1005, p. 6
`
`
`
`Table 3. Observed and Expected sequence distributions
`sorted by 59 tag composition
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`59Tag
`
`Sequences
`from forward
`primer
`
`Sequences
`from reverse
`primer
`
`Total
`sequences
`
`Expected
`sequence
`frequency
`
`Deviation
`
`(Table 3). Although the small number of tetranucleotide tagged
`primers tested makes statistically supported comparisons difficult,
`our observations on the data indicate that overall the rate of
`sequence miss-assignment for these primers was lower than for the
`dinucleotide tags.
`
`59 Primer Tags on the GS20
`
`AA
`
`AC
`
`AG
`
`AT
`
`CA
`
`CC
`
`CG
`
`CT
`
`GA
`
`GC
`
`141
`
`179
`
`68
`
`95
`
`237
`
`305
`
`291
`
`263
`
`171
`
`114
`
`115
`
`113
`
`65
`
`93
`
`231
`
`326
`
`286
`
`258
`
`153
`
`93
`
`256
`
`292
`
`133
`
`188
`
`468
`
`631
`
`577
`
`521
`
`324
`
`207
`
`274.75
`
`274.75
`
`274.75
`
`274.75
`
`274.75
`
`274.75
`
`274.75
`
`274.75
`
`274.75
`
`274.75
`
`218.75
`
`17.25
`
`2141.75
`
`286.75
`
`193.25
`
`356.25
`
`302.25
`
`246.25
`
`49.25
`
`267.75
`
`37.25
`
`DISCUSSION
`Caveats
`In this study we present data describing the viability and limitations
`of a pooled-PCR based approach to GS20 sequencing. Naturally
`the specific results of this study may be dependent on the genetic
`region targeted and the PCR primers/target