throbber
NIH Public Access
`Author Manuscript
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`Published in final edited form as:
`Nat Methods. 2008 October ; 5(10): 887–893. doi:10.1038/nmeth.1251.
`
`IDENTIFICATION OF GENETIC VARIANTS USING BARCODED
`MULTIPLEXED SEQUENCING
`
`David W. Craig1,*, John V. Pearson1,*, Szabolcs Szelinger1,*, Aswin Sekar1, Redma Margot1,
`Jason J. Corneveaux1, Traci L. Pawlowski1, Trisha Laub1, Gary Nunn2, Dietrich A.
`Stephan1, Nils Homer, and Matthew J. Huentelman1
`1The Translational Genomics Research Institute, 445 N. 5th St, 5th Floor, Phoenix, AZ 85004
`2Illumina, 9885 Town Centre Drive, San Diego, CA 92121
`
`Abstract
`We developed a generalized framework for multiplexed resequencing of targeted regions of the
`human genome on the Illumina Genome Analyzer using degenerate indexed DNA sequence
`barcodes ligated to fragmented DNA prior to sequencing. Using this method, the DNA of multiple
`HapMap individuals was simultaneously sequenced at several ENCODE (ENCyclopedia of DNA
`Elements) regions. We then evaluated the use of Bayes factors for discovering and genotyping
`polymorphisms from aligned sequenced reads. If we required that predicted polymorphisms be
`either previously identified by dbSNP or be visually evident upon reinspection of archived
`ENCODE traces, we observed a false-positive rate of 11.3% using strict thresholds (Ks>1,000) for
`predicting variants and 69.6% for lax thresholds (Ks>10). Conversely, false-negative rates ranged
`from 10.8% to 90.8%, with those at stricter cut-offs occurring at lower coverage (< 10 aligned
`reads). These results suggest that >90% of genetic variants are discoverable using multiplexed
`sequencing provided sufficient coverage at the polymorphic base.
`
`Introduction
`Genome-wide association (GWA), candidate gene, and linkage studies have identified
`thousands of moderately sized genomic regions that are associated with human disease but
`where comprehensive resequencing is needed to identify the genetic variant causing the
`association. In particular, GWA studies have identified hundreds of disease-associated
`haplotypes, typically spanning 5 to 100kb1–3. A logical next step is to identify and
`resequence all genetic variants within the associated haplotype in order to identify the
`functional variants among the many non-functional evolutionarily linked neighboring
`polymorphisms. Next-generation DNA sequencing technologies are in principle well-suited
`to this task due to their capabilities for high-throughput low-cost sequencing. While these
`technologies offer massive sequencing capacity, it is still difficult, time-consuming, and/or
`expensive to resequence large numbers of samples across moderately sized genomic regions
`(5kb-1Mb).
`
`Corresponding Author: David W. Craig, Ph.D., Investigator, Neurogenomics Division, The Translational Genomics Research
`Institute, 445 North Fifth Street, Phoenix, AZ 85004, (602) 343-8747 (phone), (602) 343-8844 (fax), dcraig@tgen.org, www.tgen.org.
`*These authors contributed equally
`Author Contributions
`D.W.C., J.V.P., M.J.H., G.N. and D.A.S. contributed to initial experimental design. S.S., A.S., M.R., J.J.C., and T.L.P. contributed to
`development and execution of exact experimental protocols. J.V.P., D.W.C., and N.H. contributed to the development of
`bioinformatics and analysis pipelines.
`Competing Financial Interests
`G. Nunn has a commercial interest as an employee of Illumina.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 1
`
`

`

`Craig et al.
`
`Page 2
`
`Simultaneous resequencing of large numbers of individuals for a targeted region is possible
`by bar-coding or indexing the reads from each individual with a short identifying
`oligonucleotide4–7. While indexing has the obvious benefit of multiplexing samples within
`a run, DNA indexing offers two key additional advantages: direct measure of base-by-base
`error rate and reduction of array-to-array or day-to-day variability. Previous pioneering
`efforts to develop DNA indexing have shown considerable promise, however adoption is
`still in its infancy and considerable challenges remain, including the development of
`practical and cost-effective approaches for short-read platforms. Beyond these experimental
`challenges, there exist few analytical frameworks that are characterized for discovering and
`genotyping genetic variants across a targeted interval using multiplexed short-read sequence
`data from multiple individuals.
`
`In this manuscript we report an experimental and analytical approach for simultaneous
`sequencing of multiple individuals using DNA indexes on the Illumina Genome Analyzer
`(GA). We use a degenerate six-base index to evaluate optimal index size and we assess
`performance of the method by resequencing HapMap individuals across ENCODE regions
`that have previously been capillary sequenced. We develop a Bayesian analytical framework
`that leverages the inherent ability of indexing to measure error and to reflect variability in
`sequencing coverage.
`
`Results
`Experimental Design
`Our experimental protocol for indexing is summarized in figure 1 and further detailed in the
`supplementary methods. We amplified multiple 5kb regions (supplementary table 1 and 2)
`by long-range PCR, for 46 individuals genotyped by the ENCODE projects1,8. Amplicons
`were equimolar pooled for each individual, digested, blunt end-repaired, flanked by an
`adenosine overhang, and ligated to one of the 46 indexed adapters (supplementary table 3
`and 4). Following ligation, samples from all individuals were pooled into a single sample
`(referred to as an indexed library), purified, enriched by PCR, and sequenced on the Illumina
`GA on a single lane of an 8 lane flow-cell. We prepared two libraries, Library A, consisting
`of 10 5kb amplicons covering 50 kb, and Library B, consisting of 14 5kb amplicons
`covering 70kb (supplementary table 2). Library A contains both regions that were previously
`capillary sequenced and regions that were not sequenced within the ENCODE project,
`whereas Library B contains only regions previously sequenced within the ENCODE project.
`
`Index Design
`We used a six-base design, which allows us to control, tolerate, and measure error during
`base-calling of the index. The six-base index provides substantial degeneracy: only 46 of the
`4096 possible nucleotide combinations were utilized (see supplementary table 4 for
`indexes). Moreover, indexes were chosen so that 1, and in some cases 2, sequencing errors
`could be tolerated without an index being incorrectly identified as being a different valid
`index. While not implemented in this study, utilizing each of the four nucleotides within an
`index may provide for higher accuracy base-calling since each base would have to be
`correctly called at least once within a sequenced read.
`
`Using this design strategy, 48 of the 4096 possible 6-mers were synthesized and used as
`indexes for multiplexed sequencing. Perfect alignment of any index should occur at ~0.1%
`by chance. The 6th base of the index was an obligate thymidine necessary for ligation of the
`adenosine overhang. The first and fifth bases were identical to detect biases during
`normalization and calculation of the deconvolution matrix. In practice, we used 46 of the 48
`indexes to allow for plate layouts that included positive and negative controls.
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 2
`
`

`

`Craig et al.
`
`Page 3
`
`Index Performance
`Typically, 3–10 million short-read (32 or 42 base) sequences were generated for each lane of
`an 8-lane flow cell, though early sequencing runs exhibited greater variability in the number
`of sequenced reads. After filtering using Illumina analysis pipeline defaults, approximately
`45–50% of the reads remained. We observed a large spread in the number of counts per
`index (figure 2). Although a systematic reason for the initial spread in index performance
`was not identified, weaknesses in index design were obvious in some cases. For example,
`‘AAAAAT’ which was frequently read as ‘AAAAAAT’, perhaps due to an oligonucleotide
`synthesis bias. A few indexes that were not well represented were complementary to other
`sections of the adapter sequence, possibly hindering adapter formation. Resequencing the
`same library gave nearly the identical distribution of reads regardless of run performance,
`indicating that the distribution is likely not due to a post-PCR enrichment step. Furthermore,
`recreating libraries and sequencing different individuals in additional sequencing runs did
`not substantially alter performance for indexes that were substantially under-represented or
`over-represented. Of the 46 initial indexes, 19 indexes varied by less than a factor of 5
`between the most and least common index and 13 indexes varying by less than a factor of 2.
`While some of the initial index variability was consistent between sequencing runs,
`retrospective analysis of gel images suggests that a portion of the index variance may be due
`to subtle differences in DNAse digestion of pooled amplicons, whereby the number of
`available ligation targets is higher for samples that are digested with higher efficiency. In
`runs subsequent to these initial libraries (data not shown), we observed that using gel-images
`of the PCR-enriched products or qPCR, to quantify the ligated adapter prior to pooling,
`reduced index variability such that the best covered index was observed 5-fold more
`frequently than the least covered index. By comparison the same ratio was 11-fold without
`quantification of the ligated primers prior to pooling. While future studies may improve
`index variability still further, it may be effectively managed without substantially affecting
`workflow, by requiring higher average coverage within a study, by sequencing on two lanes
`with different indexes, or by sequestering samples with deficient coverage for later runs.
`
`Index-level coverage
`As shown for a subset of library A, coverage across individual 5kb amplicons was even and
`generally free of large gaps (figure 3). We did observe base-to-base variability in the
`coverage, as expected from alignment of short reads. Both between amplicons and within an
`amplicon, some deviation from the expected Poisson distribution was observed. Clearly
`amplicon-to-amplicon variability contributes to some extent to the departure from the
`expected Poisson distribution. For a given index, we observed approximately a 1.5 to 2.0
`fold difference between the amplicons with the most and fewest number of reads. Inspecting
`gel images for selected amplicons confirmed that these observed differences within regions
`were largely due to uneven pooling of amplicons. The observed amplicon-to-amplicon
`variability is likely to be due to the fact that we utilized median concentrations across the
`plate when pooling amplicons for an individual, rather than individually pipetting each
`amplicon based on its specific concentration.
`
`Comparing a given amplicon across indexes (i.e. across individuals), there is clearly some
`base-level correlation in coverage based on the positions of spikes and valleys within the
`coverage plots (figure 3). Within a single amplicon there was also departure from a Poisson
`distribution, evident from the fact that the same bases had little or no coverage across
`individuals. Indeed, there is consistency between individuals with regard to bases that are
`under or over-represented. The rank correlation coefficient between indexes at a given base
`averaged 0.408, suggesting that local sequence (or base order) accounts for slightly less than
`half of the base-to-base variability in coverage.
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 3
`
`

`

`Craig et al.
`
`Page 4
`
`Error reduction/Alignment strategy
`Depending on alignment rules, aligning a short read to a reference sequence reduces the
`sequencing error rate at the cost of limiting discovery. We aligned 35-base pair sequences,
`allowing for only a single error. We are thus essentially limited to identifying single base
`substitutions in an aligned read, while limiting error to 1/35 or 2.8% as explained below. We
`further required that two stretches of 11 or more consecutive bases match the reference
`sequence or that the read have at least one stretch of 15 consecutive matches to the reference
`sequence. In both cases, our aligner required that the final 2 bases match the reference
`sequence to insure we did not over-align an error at the final base. The rules for alignment
`were largely chosen to control error, and would falsely align a randomly generated sequence
`in less than 0.1% of alignments in a 100kb region. Given our tolerance for 1 error in
`alignment, we expect a maximum per-base error rate of 2-3% (1 error in 35bases ≈ 2.8%).
`
`Alignment of short-reads has advantages. For example, one would expect that we would
`have greater difficulty detecting closely neighboring single nucleotide polymorphisms
`(SNPs) since we mostly limit our aligner to 1 non-consecutive mismatch. However, the
`short-reads stochastically overlap and these various types of neighboring genetic variants are
`observed by alignment of multiple sequences not spanning both variants.
`
`Polymorphism discovery
`Polymorphism discovery is a primary goal for resequencing an association interval for a
`GWA study, particularly under the common variant hypothesis. Indeed, in some cases one
`may only wish to know which bases are polymorphic for custom-genotyping on a separate
`platform.
`
`We first provide an intuitive explanation of our analysis approach for polymorphism
`discovery, noting that detailed equations are provided in the methods. We utilized a Bayes
`factors to compare the probability that the distribution of mismatched bases arises from
`sequencing error to the probability that the distribution of mismatches arises from diploid
`polymorphism. For example, if 20% of reads for a given base were non-concordant with the
`reference sequence across all individuals, and the non-concordant bases were due to the
`presence of a SNP, one would expect each individual to be homozygous (0% or 100%
`concordance with reference) or heterozygous (concordance split 50/50). On the other hand,
`if the 20% non-concordant bases were due to sequencing error, then the number of non-
`concordant bases for each individual would follow a binomial distribution around 20% (e.g.
`person 1~20.5%, person 2~19.3%, person 3~20.7%, etc). As described below, the error
`estimates required to calculate the probability of a genetic variant being a true variant are
`readily obtainable when individuals are indexed and multiplex-sequenced. Further, indexed
`and multiplexed sequencing removes run-to-run biases which would confound these
`estimates if all aspects of experimental design were not properly randomized. Bayes factors
`are particularly effective for the uneven coverage inherent to short read sequencing, and
`provide a mechanism to control false positives in light of more or less evidence.
`
`Sequenced regions were analyzed base-by-base for all individuals by calculating a
`polymorphism discovery Bayes factor (defined as Ks in equation 2). An example plot of Ks
`across each base (of 50kb) is shown in figure 4 for Library A; a similar analysis was
`conducted for Library B (supplemental figure 1).
`
`We next evaluated false-positive and false negative rates to assess our experimental and
`analytical framework for variant discovery (table 1 for Library B and figure 5 for both
`Library A and B). False positives are particularly difficult to quantify since not all
`polymorphic sites are known, even in previously resequenced regions. In our analysis, to be
`defined as a false positive, a variant must not exactly match the location of variants within
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 4
`
`

`

`Craig et al.
`
`Page 5
`
`dbSNP, and must not have trace sequencing data indicating a previously missed variant. In
`some cases trace sequence data was not available or unreliable. Consequently, the false
`positive rate is expected to be an upper estimate since the exact position must be validated as
`polymorphic by an existing database. False negative rates were determined by calculating if
`a base known to be polymorphic in our library of HapMap individuals reached previously
`specified Ks thresholds. This calculation of false-negative rates does have some bias, since it
`does not take into account coverage of the polymorphic base. Figure 5 plots the dependence
`of Ks on coverage.
`
`As expected, setting a higher threshold for Ks gives fewer false positives. In table 1 for
`Library A, as Ks increases from 10 to 1,000 the false positive rate decreases from 69.6% to
`11.3%. Likewise, with fixed coverage we observe the false negative rate increasing from
`10.8% to 90.8% as Ks increases from 10 to 1,000. A more detailed discussion of false-
`negative and false-positive rates is provided in the supplementary methods. Referring to
`figure 5, all the false negatives occur when the cumulative coverage of individuals with the
`rarer variant is less than 10 reads. Further highlighting the dependence of false-negatives on
`coverage, all polymorphisms that were covered by 20+ reads (summed across individuals
`known to differ from the reference) have a Ks >1,000. Overall, we observed that 90% of
`variants were detectable, though designing for >20 reads will be essential for controlling
`false negatives.
`
`Through the course of analyzing bases with a Ks > 100 for false positives, using NCBI
`archived ENCODE traces, new SNPs were discovered that were evident in visual
`reinspection of capillary traces, but that had not been annotated in dbSNP (figure 4f–h).
`These examples demonstrate that index-based resequencing can identify novel variants even
`in heavily sequenced and heavily annotated regions. Within Library B, it is intriguing to
`note that two variants with a Ks>100 were not SNPs but actually insertions (rs11279266 is a
`1bp insertion and rs10555419 is a 6bp insertion). Thus it is possible to identify genetic
`variants explicitly not allowed within the alignment scheme.
`
`Genotyping individuals at known polymorphisms
`Since false negatives are clearly tied to coverage, we explored the influence of coverage
`further by analyzing the above regions in an individual-by-individual analysis. Derived in
`Equation 3 within the methods, Ki is an analogous Bayes-factor for an individual having the
`rarer allele at a known polymorphic base. Conceptually, it can be thought of as a specific
`individual’s contribution to Ks. Shown in high granularity (figure 5c), we calculate the
`percentage of variants correctly identified in an individual given a certain number of reads.
`For example, when the coverage for a base was ~20 reads (averaging from 16 to 24), we
`detected >80–90% of the bases at Ki>10, with a false-positive rate of 1.6%. In comparison
`to polymorphism discovery, the low false-positive rates of genotyping at a known
`polymorphic base are due to the fact that we are no longer assessing thousands of bases for a
`rare event, but rather assessing a few dozen individuals for a more frequent event.
`
`Discussion
`Our experience suggests that achieving adequate coverage is one of the most important
`factors in the design of a multiplexed targeted resequencing experiment. Depending on
`assumptions made within the experiment, the desired coverage (and as a consequence, the
`cost) can vary substantially. Key considerations include whether the objective is (1)
`discovering genetic variants for genotyping by a separate method such as custom SNP
`genotyping, (2) conducting polymorphism discovery and variant calling within one
`sequencing experiment, and/or (3) exhaustively resequencing for all common and rare
`variants.
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 5
`
`

`

`Craig et al.
`
`Page 6
`
`Exhaustive polymorphism discovery is the next major phase for GWA studies. Shown in
`Figure 4, indexing of short-reads is surprisingly robust at polymorphism identification. For
`example, even when a highly restrictive error alignment scheme was used, we were able to
`identify a novel coding SNP three bases from an annotated SNP. Additionally, it is
`encouraging to see the discovery of insertions using an alignment scheme only allowing
`substitutions. Finally, it is a highly encouraging that an automated analysis strategy for
`short-read data can uncover novel variants, even in regions that were previously sequenced
`for variant discovery using these HapMap individuals.
`
`Based on the false-positive and false-negative rates, a critical factor will be how to balance
`coverage and cost. In table 1, utilizing a threshold of Ks > 1,000, we observe a low false-
`positive rate (11.3%) and a high false-negative rate (90.8%). Since we required that the
`exact base must be polymorphic in an existing database, the actual false-positive rate may be
`lower. As evident in figure 5, false negatives are due to the additional coverage (>10 reads)
`is required for overcoming higher Ks thresholds. Considering the substantial base-to-base
`variability in figure 3, one would not simply want to design for an average coverage of 10
`reads. Rather, by designing for ~50 reads or more, one may minimize both the false negative
`and false-positive rates, given coverage variability of short-read sequencing.
`
`While whole-genome sequencing may be the primary motivator for improvements in
`sequencing technology, it is clear that next-generation technologies are immediately useful
`for focused hypothesis-driven sequencing of linkage peaks, groupings of candidate genes, or
`sequencing the entire known coding sequence of the human genome. In this report, we
`developed per-individual indexing of pooled PCR amplicons to carry out targeted
`sequencing. However, it is straightforward to envision using other sample preparation
`methods, such as genome partitioning 9–12. One could replace the pooled amplicons from
`our experimental outline (figure 1) with total genomic DNA and complete the partitioning
`by a hybridization approach after pooling ligated amplicons. Indeed, variation discovery
`through the resequencing of all candidate regions implicated in a disease across dozens, and
`possibly hundreds, of individuals could be significantly accelerated by merging multiplex
`capture, indexing, and next-generation sequencing approaches into a single protocol.
`
`Methods
`Amplification and pre-ligation sample preparation
`Two primary amplicon libraries (Library A and B, specific targeted regions listed in
`supplementary table 2) were constructed from individually amplified 5kb regions using
`long-range PCR. Regions composing a library were chosen from the ENCODE project and
`selected to provide a sampling of different genomic region types. Only a portion of the
`overall ENCODE regions have been previously sequenced as part of the SNP discovery
`portion of the ENCODE project. Library A was a composite of previously sequenced
`regions and regions not sequenced in ENCODE. Library B was entirely composed of
`regions that were previously sequenced. Flanking primers for each amplicon (supplementary
`table 5) were manually selected. While we did not screen for the existence of known
`polymorphisms within the primer sequences, such effort would be advisable in future
`efforts. A detailed description of region amplification, amplicon pooling, fragmentation,
`preparation of 48 adapters, end-repair, ligation, PCR enrichment, cluster generation and
`sequencing on the Illumina Genome Analyzer are provided in the supplementary methods.
`
`Analysis – Base calling and alignment
`Illumina GA images were analyzed using a modified Illumina GA (0.2.2.6) processing
`pipeline. Descriptions of the modifications and all scripts are available at
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 6
`
`

`

`Craig et al.
`
`Page 7
`
`bioinformatics.tgen.org. A default matrix deconvolution file was used for base-calling by
`‘Bustard’ based on a control phi-X library provided by Illumina. Following base-calling a
`script was used to access all sequences, regardless of quality score. We deviated from the
`default cut-offs provided by Illumina’s ‘GERALD’ process since it was found that sequence
`quality was better controlled by matching to an index (46/4096 or 0.1% by chance) or
`matching with 1 or fewer errors to the reference sequence. Bases were aligned by
`progressively truncating the sequence at the 5’ end until a unique alignment was obtained
`with a probability of stochastic alignment of less than 1%. This approach is distinct from
`recently described short-read alignment schemes13, but clearly has some of the same
`features.
`
`Aligned sequences were summarized into a binomial of counts agreeing (ai) or disagreeing
`(bi) with the reference sequence for the ith individual of n total individuals for each base (s).
`The error rate (θs) is the percentage of reads disagreeing with the reference sequence across
`all samples. Model 1 (M1) assumes that the error rate for an individual equals the error rate
`for all individuals, or θi = θs. Therefore we can estimate θs as:
`
`Equation 1
`
`Model 2 (M2) assumes that for some individual i we have θi ≠ θs. To calculate this
`likelihood we use a hyperprior offset (σs) for three possible genotypes. The hyperprior can
`be thought of as conditioning on the zygosity of the individual and thus reflecting the
`uncertainty of the genotype of the individual at a given base. In this analysis we focus on the
`detection of biallelic SNPs but triallelic or other types of SNPs could be considered under
`more complex models. The Bayes factor for a base position is:
`
`Equation 2
`
`In equation 2, Ks is the Bayes factors across all individuals and is calculated for each SNP.
`The value for Ks is effectively identifying polymorphisms at a base, and determination of
`the individuals that show the rare variant is accomplished by Ki:
`
`Equation 3
`
`The analysis for rare variants in the above equation uses the prior probability of p(σs)
`derived from the distribution across all individuals and all bases initially assuming all other
`individuals are homozygous for the reference allele, which is a reasonable assumption given
`that novel variants are assumed to be rare. By successive iterations, p(σs) is then
`recalculated by calling variants AA, AB, and BB under a given threshold (e.g.
`Ki={3,10,100}), and recalculating p(σs) based on these variant calls. Plots of Bayes factors
`vs. error rates are provided in supplementary figure 2 to show distribution of Bayes factors
`and their correlation with error rate.
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 7
`
`

`

`Craig et al.
`
`Page 8
`
`If a SNP is known it is also reasonable to use other prior information at a base, such as allele
`frequency in the population. However, prior information on the location and allele frequency
`of known SNPs was not used in this study in order to better evaluate the effectiveness of the
`underlying framework.
`
`Supplementary Material
`Refer to Web version on PubMed Central for supplementary material.
`
`References
`1. Frazer KA, Ballinger DG, Cox DR, et al. Nature. 2007; 449(7164):851. [PubMed: 17943122]
`2. Nature. 2007; 447(7145):661. [PubMed: 17554300]
`3. Zondervan KT, Cardon LR. Nat Protoc. 2007; 2(10):2492. [PubMed: 17947991]
`4. Meyer M, Stenzel U, Myles S, et al. Nucleic Acids Res. 2007; 35(15):e97. [PubMed: 17670798]
`5. Parameswaran P, Jalili R, Tao L, et al. Nucleic Acids Res. 2007; 35(19):e130. [PubMed: 17932070]
`6. Milosavljevic A, Harris RA, Sodergren EJ, et al. Genome Res. 2005; 15(2):292. [PubMed:
`15687293]
`7. Hamady M, Walker JJ, Harris JK, et al. Nat Methods. 2008; 5(3):235. [PubMed: 18264105]
`8. Birney E, Stamatoyannopoulos JA, Dutta A, et al. Nature. 2007; 447:7146–799.
`9. Albert TJ, Molla MN, Muzny DM, et al. Nat Methods. 2007; 4(11):903. [PubMed: 17934467]
`10. Hodges E, Xuan Z, Balija V, et al. Nat Genet. 2007; 39(12):1522. [PubMed: 17982454]
`11. Porreca GJ, Zhang K, Li JB, et al. Nat Methods. 2007; 4(11):931. [PubMed: 17934468]
`12. Okou DT, Steinberg KM, Middle C, et al. Nat Methods. 2007; 4(11):907. [PubMed: 17934469]
`13. Jeck WR, Reinhardt JA, Baltrus DA, et al. Bioinformatics. 2007; 23(21):2942. [PubMed:
`17893086]
`
`Acknowledgments
`We wish to acknowledge funding from the state of Arizona, National Heart Lung and Blood Institute (U01
`HL086528) and the Stardust foundation.
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 8
`
`

`

`Craig et al.
`
`Page 9
`
`Figure 1.
`Schematic describing the preparation of indexed libraries. The red box indicates the
`indexing step, where for each person a unique indexed adapter was ligated to the fragmented
`genomic DNA.
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 9
`
`

`

`Craig et al.
`
`Page 10
`
`Figure 2. Comparison of index performance
`Index variability in initial sequencing runs (Library A) used for evaluating index
`performance are shown (top graph). Percentages of reads aligning to the reference sequence
`are listed by index, without introduction of normalization methods. A total of 30 indexes
`were present in >0.05% of all aligned reads. Highlighted in the blue box are 19 indexes with
`less than 5 fold difference in index frequencies, used in subsequence studies. Indexes
`matching with 0 errors are in blue bars and indexes with 1 error are in magenta bars. The
`bottom graph shows the location of errors by base, for each index.
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 10
`
`

`

`Craig et al.
`
`Page 11
`
`Figure 3. Relationship between mean and local coverage
`Example coverage of 4 individuals sequenced within a single line of an 8-lane flow-cell for
`10 pooled amplicons as part of Library A. Amplicons are shown consecutively for each
`individual by the alternating shaded background. Index sequence and mean coverage for that
`individual are shown above each graph. The maximum and minimum coverage is shown for
`each amplicon in the top of the graph. Overlaying pie charts show the observed distribution
`of bases across all amplicons and the expected distribution determined from a Poisson
`distribution of the mean coverage, binned by 0 reads, 1–4 reads, 5–9 reads, 10–19 reads, and
`>20 reads.
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 11
`
`

`

`Craig et al.
`
`Page 12
`
`Figure 4. Discovery of variant bases by simultaneous analysis of all individuals
`(a.) The Bayes-factor for polymorphism discovery(Ks) is plotted for each of the10
`sequenced 5kb amplicons from Library A. Exact positions matching known polymorphisms
`are colored as red spheres and the dbSNP identifier is provided for the most significant
`SNPs. Black bars at top indicate locations of documented SNPs. A magnified view of
`amplicon 1 (b.) and amplicon 6 (c.) is provided to compare variants predicted by indexed-
`multiplexed sequencing to previous deep capillary sequencing results for the same
`individuals as part of the ENCODE project. (d–e.) Examples of false-positives arising from
`sequence homology to elsewhere in the genome. (f–i.) Examples of sequence traces
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 12
`
`

`

`Craig et al.
`
`Page 13
`
`validating the discovery of novel SNPs not previously annotated in ENCODE capillary
`sequencing traces. Similar analysis was conducted on Library B (shown in the
`supplementary figure 1).
`
`Nat Methods. Author manuscript; available in PMC 2011 September 12.
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`NIH-PA Author Manuscript
`
`Ariosa Exhibit 1007, p. 13
`
`

`

`Craig et al.
`
`Page 14
`
`Figure 5. Relationship between base-level coverage and Bayes-factor for polymorphism
`discovery and

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket