throbber
Method
`Systematic comparison of three genomic enrichment
`methods for massively parallel DNA sequencing
`Jamie K. Teer,1,3 Lori L. Bonnycastle,1,3 Peter S. Chines,1 Nancy F. Hansen,1
`Natsuyo Aoyama,2 Amy J. Swift,1 Hatice Ozel Abaan,1 Thomas J. Albert,2
`NISC Comparative Sequencing Program,1 Elliott H. Margulies,1 Eric D. Green,1
`Francis S. Collins,1,4 James C. Mullikin,1,4 and Leslie G. Biesecker1,4
`1National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA; 2Roche NimbleGen
`Inc., Madison, Wisconsin 53719, USA
`
`Massively parallel DNA sequencing technologies have greatly increased our ability to generate large amounts of se-
`quencing data at a rapid pace. Several methods have been developed to enrich for genomic regions of interest for targeted
`sequencing. We have compared three of these methods: Molecular Inversion Probes (MIP), Solution Hybrid Selection
`(SHS), and Microarray-based Genomic Selection (MGS). Using HapMap DNA samples, we compared each of these
`methods with respect to their ability to capture an identical set of exons and evolutionarily conserved regions associated
`with 528 genes (2.61 Mb). For sequence analysis, we developed and used a novel Bayesian genotype-assigning algorithm,
`Most Probable Genotype (MPG). All three capture methods were effective, but sensitivities (percentage of targeted bases
`associated with high-quality genotypes) varied for an equivalent amount of pass-filtered sequence: for example, 70%
`(MIP), 84% (SHS), and 91% (MGS) for 400 Mb. In contrast, all methods yielded similar accuracies of >99.84% when
`compared to Infinium 1M SNP BeadChip-derived genotypes and >99.998% when compared to 30-fold coverage whole-
`genome shotgun sequencing data. We also observed a low false-positive rate with all three methods; of the heterozygous
`positions identified by each of the capture methods, >99.57% agreed with 1M SNP BeadChip, and >98.840% agreed with
`the whole-genome shotgun data. In addition, we successfully piloted the genomic enrichment of a set of 12 pooled samples
`via the MGS method using molecular bar codes. We find that these three genomic enrichment methods are highly
`accurate and practical, with sensitivities comparable to that of 30-fold coverage whole-genome shotgun data.
`
`[Supplemental material is available online at http://www.genome.org. The sequence data from this study have been
`submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no.
`SRA022076. Bam2mpg is freely available at http://research.nhgri.nih.gov/software/bam2mpg.]
`
`The ability to identify genetic alterations underlying disease and
`phenotypic variation is a major goal of the ongoing genomics
`revolution. Large-scale medical sequencing holds the promise of
`elucidating the genetic architecture of virtually all diseases, in-
`cluding the relative role of rare and common genetic variants. Such
`information should inform our understanding of the pathways
`involved in disease pathogenesis. Additionally, the ability to apply
`variant discovery to other organisms with annotated genomes is
`also of great value. So-called next-generation DNA sequencing
`technologies that involve massively parallel clonal ensemble se-
`quencing are creating new opportunities for comprehensive ge-
`nomic interrogation, bypassing some of the limitations associated
`with high-throughput electrophoresis-based sequencing methods
`(Margulies et al. 2005; Shendure et al. 2005; Bentley et al. 2008).
`Although advances in these new technologies are being made at
`a rapid pace, the cost of sequencing a whole human genome and
`the associated costs and technical hurdles of data storage and
`analysis remain high for most large projects. As a result, methods
`
`3These authors contributed equally to this work.
`4Corresponding authors.
`E-mail collinsf@mail.nih.gov; fax (301) 402-2218.
`E-mail mullikin@mail.nih.gov; fax (301) 480-0634.
`E-mail leslieb@helix.nih.gov; fax (301) 402-2170.
`Article published online before print. Article and publication date are at
`http://www.genome.org/cgi/doi/10.1101/gr.106716.110.
`
`are needed for efficient comprehensive sequencing of targeted
`genomic regions.
`A number of genomic enrichment (or targeted capture)
`methods have been developed and used with varying levels of
`success (for reviews, see Garber 2008; Summerer 2009; Turner et al.
`2009b; Mamanova et al. 2010). Not surprisingly, each method has
`both common and unique issues related to the required input
`DNA, specificity and coverage, sensitivity and accuracy, scalability
`and potential for automation, and cost-effectiveness. Although
`multiple publications describe the individual methods (Albert et al.
`2007; Hodges et al. 2007; Okou et al. 2007; Porreca et al. 2007;
`Craig et al. 2008; Krishnakumar et al. 2008; Bau et al. 2009; Gnirke
`et al. 2009; Herman et al. 2009; Hodges et al. 2009; Li et al. 2009;
`Summerer et al. 2009; Tewhey et al. 2009), numerous variables are
`inherent to each experiment, including the nature of the targeted
`genomic region(s), sequencing platform, analysis software, and
`performance metrics; these variables make objective comparisons
`of the different methods challenging to perform.
`Here, we report the testing, optimizing, and rigorous com-
`paring of three genomic enrichment methods: Molecular Inver-
`sion Probes (MIP) (Porreca et al. 2007), Solution Hybrid Selection
`(SHS; Agilent) (Gnirke et al. 2009), and Microarray-based Genomic
`Selection (MGS; Roche-NimbleGen) (Albert et al. 2007; Okou et al.
`2007). All three methods were tested for their ability to capture the
`same 2.61 Mb of noncontiguous DNA sequence from an over-
`lapping set of two HapMap DNA samples. We also report, for the
`
`1420 Genome Research
`www.genome.org
`
`20:1420–1431; ISSN 1088-9051/10; www.genome.org
`
`PGDX EX. 1018
`Page 1 of 12
`
`

`

`first time, a novel Bayesian genotype-assigning algorithm, Most
`Probable Genotype (MPG), which was used to analyze the se-
`quence data from all three capture methods. Furthermore, we in-
`troduce a pre-capture bar-coding strategy to allow pooling of
`samples for capture and sequencing. In contrast to previous re-
`views of genomic enrichment methods, we have directly com-
`pared these methods by evaluating genotype sensitivity and ac-
`curacy of variant detection, providing insights about overall
`performance and relative costs.
`
`Results
`
`Both of the solution-based methods—MIP (Fig. 1A) and SHS (Fig.
`1B)—are highly multiplexed capture methods that can enrich for
`a large number of genomic regions. We have incorporated pub-
`lished improvements to the original MIP-based method (Li et al.
`2009; Turner et al. 2009a), including improvements to facilitate
`automated and effective probe design; accuracy of the MIP-based
`method has also been improved by segregating overlapping probes
`to two separate probe sets. The 384K MGS array (Fig. 1C) also al-
`lows for highly multiplexed capture of genomic regions in a single
`hybridization but uses a solid-phase DNA-oligonucleotide hy-
`bridization step; we adapted the original MGS-based protocol
`(Albert et al. 2007) to increase capture efficiency and to allow for
`sample pooling to reduce costs. There were a few considerations
`specific to several of the methods that led us to modify or adapt
`these methods for the purposes of this study. These are described
`below, before the overall results of the study are described.
`
`Molecular Inversion Probes (MIPs)
`
`In the process of implementing an MIP-based genomic enrich-
`ment method, we found that our sequence data included the probe
`arm sequences that were designed to hybridize to DNA flanking
`the region of interest (ROI) (data not shown). This occurred when
`probes were required to overlap each other in order to capture a
`long region. Because these arm sequences of the MIP probes were
`sequenced and aligned, they could falsely increase the proportion
`of the reference allele and yield false-negative results. In addition
`to the probe arm sequences, some MIP backbone sequence (non-
`human) was also present and aligned to the human reference se-
`quence with several mismatches, which generated significant
`numbers of false-positive variants.
`One approach for avoiding such artifacts involves end-se-
`quencing the captured DNA fragments, thus avoiding shotgun li-
`brary preparation and the inclusion of probe backbone sequence.
`The increasing capabilities of sequencers to generate longer read
`lengths make this approach more feasible, which may improve
`alignment rates and provide better coverage and higher sensitiv-
`ities (Turner et al. 2009a). We tested this strategy by designing
`probes to target ROI fragments of about 105 bases with 76-base
`end-sequences; however, only 55% of the ROI bases could be
`covered by this design, which we concluded was unacceptable.
`We avoided these artifacts by dividing the capture MIPs into
`two probe sets. Although the two probe sets overlapped each other,
`the probes within each set were nonoverlapping. Within a non-
`overlapping set, the targeting arm and probe backbone sequences
`are distinguishable from the captured DNA and thus can be ex-
`cluded following alignment. For the entire capture-to-sequencing
`process, each of the two probe sets was treated independently.
`After sequencing and alignment, the targeting arm and backbone
`bases were computationally removed, and the data were combined.
`
`Genomic enrichment methods comparison
`
`Figure 1. Three genomic enrichment methods. (A) (MIP) Molecular
`Inversion Probe: 70 base probes are prepared and hybridized to genomic
`DNA. Capture occurs by filling in sequence between the probe-targeting
`arms with polymerase and then sealing the circle with ligase. Total ge-
`nomic DNA is removed with nucleases. The remaining closed circles un-
`dergo shotgun library and sequencing library preparation, followed by
`sequencing. (B) (SHS) Solution Hybrid Selection: A sequencing library is
`prepared from genomic DNA. This library is hybridized to biotinylated
`RNA probes in solution and recovered with streptavidin beads. Eluted
`products are amplified prior to cluster generation and sequencing. (C )
`(MGS) Microarray-based Genomic Selection: A sequencing library is pre-
`pared from genomic DNA and hybridized to a capture array. Eluted
`products are amplified prior to cluster generation and sequencing.
`
`Genome Research
`www.genome.org
`
`1421
`
`PGDX EX. 1018
`Page 2 of 12
`
`

`

`Teer et al.
`
`As this strategy required the separate sequencing of the DNA
`captured from each probe set, we needed two lanes, which gener-
`ated twice as much sequence data for the MIP method compared
`to the other two; we used this full amount of sequence (;1.1 Gb)
`in our comparison analyses throughout this study. Future appli-
`cation of this strategy could include indexing of each probe set
`to allow merging of the individually captured DNA from each set,
`eliminating the need for two separate sequencing lanes (and the
`extra sequence) for one sample.
`
`Solution Hybrid Selection (SHS)
`
`Following the initial description of SHS (Gnirke et al. 2009),
`Agilent Technologies released a commercial version (SureSelect).
`Due to improvements communicated by the vendor, we chose to
`implement SHS using these kits and performed the captures as
`described by the manufacturer.
`
`MGS and pooled bar-coded samples
`
`Considering the high cost of MGS (Roche NimbleGen) arrays and
`the increasing single-lane sequencing capacity of next-generation
`sequencing platforms, we developed a novel strategy that enables
`the efficient capture of a pool of 12 individually bar-coded Illu-
`mina paired-end (PE) libraries. In contrast to most methods used
`for bar-coding samples for pool sequencing, this strategy is opti-
`mized for use in targeted hybrid capture, followed by sequencing.
`Specifically, we made a 6-base indexed paired-end library for each
`of 12 genomic DNA samples (including the same two HapMap
`samples as were used in the MIP and SHS methods) and simulta-
`neously performed MGS of all 12 samples with one array. The in-
`dex served as a bar code and provided individual-specific sequence
`data for the array-enriched targets after sequencing. Reads were
`assigned to each bar code if the index sequence had no more than
`one mismatch. Of the reads that passed the quality filter (Illumina
`GERALD chastity filter), we were able to assign ;99% to a unique
`sample using this indexing strategy. A narrow distribution of se-
`quencing depth across samples, or sample uniformity, is critical
`when sequencing bar-coded pools to avoid over-representation
`
`Table 1.
`
`Indexed samples pooled for MGS capture and sequencing
`
`of some samples at the expense of others. Poor sample unifor-
`mity would require additional sequencing to provide adequate
`depth of coverage of the samples with lower representation. Per-
`fect sample uniformity would yield 8.3% of filtered reads origi-
`nating from each of the 12 samples. The difference between the
`bar-coded samples with the highest (11.4%) and lowest (5.0%)
`number of filtered reads was slightly over twofold (Table 1). We
`conclude that this level of sample uniformity is acceptable for
`many potential applications of this target enrichment method.
`Another important aspect of sequencing pooled samples is
`to have each ROI base covered to a similar level across all samples
`in the pool. This allows for consistent interrogation of the same
`ROI bases for all samples and facilitates follow-up attempts to
`re-sequence only the smaller subset of missed regions. We found
`that samples were fairly uniform with respect to well-covered and
`poorly covered bases. Approximately 78% of the ROI bases had
`good coverage ($20-fold redundancy in all 12 samples), 20% had
`variable coverage (one or more of the indexed samples had one- to
`19-fold coverage), and 2% had no coverage in any sample. Bases
`with no coverage were mainly those for which probes could not
`be designed.
`
`Input DNA requirements
`
`Sequencing projects often involve analyzing samples with limited
`DNA resources. We performed MIP capture with 1 mg of input
`DNA, the lowest amount of the three methods. The SHS and MGS
`protocols required 3 mg and 4 mg of DNA, respectively. Of note, the
`modified MGS protocol described here required 4 mg of starting
`DNA to generate a library suitable for capture and subsequent se-
`quencing on the Illumina GAII platform. Although this reflects an
`80% reduction in the amount of input DNA recommended by the
`Roche/NimbleGen MGS protocol (4 vs. 20 mg), we believe that an
`even smaller amount of DNA may work as well. We started by
`shearing 4 mg of input DNA to account for the variability and
`inaccuracies of measured DNA concentration values; the sub-
`sequent library preparation steps (end repair and adapter ligation)
`proceeded with only 2 mg of sheared DNA. Thus, to further reduce
`input DNA requirement, more attention could be placed on
`
`Index
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`Unassigned
`All
`Average
`
`Reads
`(M)
`
`17.1
`15.7
`14.3
`17.5
`13.7
`13.3
`12.5
`11.2
`9.8
`10.1
`9.7
`7.6
`1.0
`153.6
`12.7
`
`Pass filter
`
`Percent
`of all
`reads
`
`11.2
`10.2
`9.3
`11.4
`8.9
`8.6
`8.2
`7.3
`6.4
`6.6
`6.3
`5.0
`0.7
`
`Aligned
`
`Aligned to ROI
`
`Sequence
`(Mb)
`
`Percent
`of all
`reads
`
`Sequence
`(Mb)
`
`Percent
`of all
`reads
`
`Sequence
`(Mb)
`
`720
`658
`601
`737
`577
`558
`526
`469
`411
`423
`409
`321
`44
`6453
`534
`
`11.4
`10.2
`9.4
`11.4
`8.9
`8.7
`8.2
`7.3
`6.4
`6.6
`6.4
`5.0
`0.6
`
`560
`503
`464
`562
`441
`431
`404
`360
`318
`323
`315
`248
`29
`4961
`411
`
`12.3
`9.3
`9.7
`10.5
`8.6
`9.1
`8.3
`7.2
`7.0
`6.3
`6.5
`5.3
`0.7
`
`341
`258
`268
`290
`238
`251
`230
`200
`193
`175
`181
`146
`19
`2789
`231
`
`Coverage depth
`(Percent of bases in ROI)
`
`13
`
`103
`
`203
`
`96.6
`96.5
`96.5
`96.5
`96.3
`96.4
`96.3
`96.3
`96.2
`96.2
`96.0
`96.1
`
`94.3
`93.7
`93.7
`93.7
`93.4
`93.5
`93.2
`92.7
`92.3
`92.1
`91.9
`91.2
`
`91.9
`90.6
`90.5
`90.4
`89.9
`89.9
`89.5
`88.2
`87.2
`86.6
`86.2
`83.8
`
`Both 36 and 51 base reads were generated for MGS. Thirty-six bases of each MGS 51 base read were used. Total Illumina chastity filtered sequence counts
`for MGS include the 6-base index at the start of each read.
`
`1422 Genome Research
`www.genome.org
`
`PGDX EX. 1018
`Page 3 of 12
`
`

`

`re-quantifying DNA with more reliable, but more time-consuming,
`methods.
`
`Comparison of methods
`
`Our regions of interest comprised 2.61 Mb of noncontiguous hu-
`man genome sequence consisting of exons and conserved ele-
`ments associated with 528 genes (see Methods). Successful probe
`designs were achieved for 95.9% (2.506/2.612 Mb) of the ROI bases
`for the MIP, 92.6% (2.419/2.612 Mb) for the SHS, and 93.8%
`(2.450/2.612 Mb) for the MGS methods. Since sequencing projects
`aim to interrogate the entirety of the ROI, we assessed the perfor-
`mance metrics based on the entire 2.61 Mb of ROI rather than only
`on the regions for which suitable probes could be designed. Fur-
`thermore, analysis of the ROI across all three methods allowed
`us to make more appropriate comparisons for this study.
`We performed targeted enrichment and sequencing on mul-
`tiple samples for each of the methods. Although the majority of
`the samples from the MIP and SHS methods were distinct from
`those of the indexed samples pooled for the MGS capture, there
`were two HapMap samples in common across all three methods
`(NA18507 and NA12878). The amount of sequence generated for
`each of the indexed samples was comparable to that for the single
`sample experiments (Table 2). The data presented here compare
`only these two samples unless otherwise noted (i.e., ‘‘extended
`sample sets’’ refers to all samples).
`
`Fraction of sequences that align to ROI
`
`The first metric examined was the fraction of the captured DNA
`that aligned to the ROI. This metric reflects the ability of the
`method to enrich for appropriate targets and greatly influences the
`cost (as more sequencing is required for a lower aligned fraction).
`Approximately 10 to 14 million filtered reads were obtained for
`each of the two samples captured with either the SHS or MGS
`methods; in contrast, about 30 million filtered reads were obtained
`with the MIP method due to the need to generate two non-
`overlapping captures (see above) (Table 2). We used ELAND (Illu-
`mina, Inc.) to align the sequence reads to the reference human
`genome and then calculated depth of coverage for our ROI. The
`sequencing was performed at different times during the project,
`and the capability of the sequencer evolved during this time pe-
`riod. Consequently, the amount of generated sequence varied for
`a given number of reads among the methods. To equalize the
`comparisons, we only used a maximum of 36 bases of each se-
`quence read when calculating the various metrics in this analysis.
`
`Table 2. Genotype sensitivity
`
`Genomic enrichment methods comparison
`
`Furthermore, the total filtered sequence data count for MGS in-
`cluded the 6-base index at the beginning of each sequence read,
`accounting for the extra nontarget sequence. Although the pro-
`portion of filtered reads that aligned uniquely to the genome
`varied (76%–90%), the amount that aligned uniquely to the ROI
`was more consistent (52%–59%) for all methods and samples.
`A smaller fraction of MIP-generated reads aligned to the ge-
`nome as compared to SHS- or MGS-generated reads. Much of this
`difference was due to the MIP probe-backbone sequence, which
`is still in the library at this stage; it can only be computationally
`removed once the alignment is completed, and the captured and
`designed/backbone bases can be differentiated. As the backbone
`sequence is not human, it added many mismatches to a given
`alignment and prevented these reads from aligning to the genome.
`Even though a smaller fraction of MIP-generated reads aligned to the
`genome, a higher fraction of the genome-aligned reads (compared
`to the other two methods) overlapped the ROI, resulting in a total
`fraction of ROI-aligning reads that was similar to that seen with the
`other methods.
`
`Depth and uniformity of coverage
`
`Important metrics of the sequence generated by a capture method
`include the number of sequence reads that overlap each base across
`the ROI and the uniformity of that coverage depth. This is because
`accurate genotype assignments depend on a minimum coverage
`depth (typically 10- to 20-fold) at each base. Uneven depth of cov-
`erage across ROI, or poor uniformity, creates the need for more se-
`quencing to provide sufficient data for poorly represented regions.
`The proportion of ROI bases with adequate depth of coverage
`was consistent across samples for a given method but varied
`among the methods (ranging from ;79%–94% with $10-fold and
`;74%–91% with $20-fold depth of coverage) (Table 2). A greater
`proportion of the ROI bases had at least 10-fold coverage depth for
`MGS, especially compared to the MIP method. This was likely due
`to differences in the uniformity of coverage for the different
`methods (Fig. 2). The distribution for MIP-derived sequence depth
`of coverage was spread over a wide range (one- to 1937-fold, me-
`dian at 157-fold); this was less the case for SHS (one- to 675-fold,
`median at 101-fold) and MGS (one- to 481-fold, median at 66-
`fold). In general, as the depth of coverage distribution widens,
`bases with excessively high read depth account for a large share
`of the sequence reads, necessitating more sequence data to ensure
`that less efficiently captured regions reach the threshold for a high-
`quality genotype assignment. Due to its more uniform capture,
`MGS achieved high-quality genotype assignments at more ROI
`
`Filtered
`
`Aligned
`
`Aligned to ROI
`
`Sample
`
`Method
`
`Reads
`(M)
`
`Bases
`(Mb)
`
`Filtered
`reads (%)
`
`Bases
`(Mb)
`
`Filtered
`reads (%)
`
`Bases
`(Mb)
`
`NA18507
`NA12878
`NA18507
`NA12878
`NA18507
`NA12878
`
`MIP
`MIP
`SHS
`SHS
`MGS
`MGS
`
`29.3
`29.7
`14.2
`13.9
`9.8
`14.3
`
`1055
`1069
`511
`500
`412
`601
`
`78.2
`76.4
`90.8
`89.9
`90.1
`89.7
`
`824
`817
`464
`450
`318
`462
`
`57.0
`56.2
`59.2
`55.4
`54.7
`52.0
`
`601
`601
`302
`277
`193
`268
`
`Coverage of ROI (%)
`
`Sensitivity (%)
`
`103
`
`79.2
`79.0
`87.2
`86.5
`92.3
`93.7
`
`203
`
`74.1
`73.8
`83.0
`82.0
`87.2
`90.6
`
`ROI called
`
`78.0
`77.9
`86.2
`85.3
`91.3
`93.0
`
`Thirty-six base reads were generated for MIP and SHS, whereas both 36 and 51 base reads were generated for MGS. To make MGS values more
`comparable with MIP and SHS data and consistent with other data reported in this paper, only 36 bases of each MGS 51 base read were used. Total
`Illumina chastity filtered sequence counts for MGS include the 6-base index at the start of each read.
`
`Genome Research
`www.genome.org
`
`1423
`
`PGDX EX. 1018
`Page 4 of 12
`
`

`

`Teer et al.
`
`Figure 2. Depth of coverage distribution. Distributions of depth of coverage at each ROI position. Scales have been standardized for comparison
`purposes, and maximum coverage depth values are indicated above the arrow. (A) (MIP) Molecular Inversion Probe. (B) (SHS) Solution Hybrid Selection.
`(C ) (MGS) Microarray-based Genomic Selection.
`
`egies) for a specific project. We analyzed the genotype assignments
`for all three methods at each ROI base and determined the number
`of positions in NA18507 covered by all, two, or one method. There
`was considerable overlap among the three methods. Specifically,
`the same 70.33% of the ROI bases were assigned a genotype by all
`three methods, an additional 25.02% was assigned by at least one
`method, and 4.65% were not assigned by any method (Fig. 4).
`Overall, SHS and MGS had the greatest overlap, sharing genotype
`assignments for ;83% of the bases of the ROI. We would expect by
`random chance that ;0.2% of our ROI would be unassigned in all
`three methods, which is much less than the observed value
`(4.65%). Similarly, ;61% of the ROI is expected by chance to have
`genotype assignments in all three methods, which is also less
`than the observed value (70.33%). This suggests that similarities
`among the three methods contribute to the observed overlap. One
`
`bases. The uniformity for SHS was similar to that of MGS, and its
`10-fold and 20-fold coverage depth statistics were almost as high as
`for MGS. The uniformity for MIP was very broad, resulting in fewer
`high-quality genotype assignments.
`
`Genotype sensitivity
`
`We used a new Bayesian genotype-assigning algorithm, Most
`Probable Genotype (MPG), to determine genotypes. Specifically,
`we only accepted genotype assignments with an MPG prediction
`score of 10 or greater, which we have empirically found to yield
`a balance between sensitivity and accuracy (see Methods and Sup-
`plemental Methods). An MPG score of 10 corresponded roughly
`with 103 to 203 depth of coverage. We defined genotype sensi-
`tivity as the percentage of ROI bases with MPG assignment scores
`of $10. The MIP method had the lowest genotype sensitivity at
`;78% even with twice the amount of sequence data (Table 2). The
`other two methods were similar to each other, with MGS slightly
`more sensitive (;93%) than SHS (;86%). This high genotype
`sensitivity was consistently achieved, even when examining an
`extended set of samples (Fig. 3). We have applied each of these
`methods to other projects with multiple samples and have ob-
`served a similarly low variance in genotype sensitivity among
`samples within each project (data not shown). These results are
`especially encouraging since the probes could be designed to
`capture only 93%–96% of the ROI bases for each of the latter two
`methods. For example, the MGS probe design covered 94.0% of the
`ROI bases, and this method had genotype sensitivity of up to
`93.0% overall, which suggests that almost all of the designed target
`bases yielded high-quality genotypes. Similarly, the SHS probe
`design covered 92.6% of the ROI bases and had a genotype sensi-
`tivity of 86.2%, assigning a high-quality genotype to the vast
`majority of the targeted bases. Thus, we can conclude that ge-
`notype sensitivity was good and consistent for all three methods,
`but that MGS had the highest genotype sensitivity.
`
`Bases with no genotype assignments
`
`We also determined the characteristics of the ROI bases with no
`genotype assignments. Such information may prove useful for
`choosing a genomic enrichment method (or combination of strat-
`
`Figure 3. Genotype sensitivity across multiple samples. Boxplots show-
`ing distribution of genotype call sensitivities across multiple samples (ex-
`tended sample set) for each capture method. (N ) Number of samples.
`
`1424 Genome Research
`www.genome.org
`
`PGDX EX. 1018
`Page 5 of 12
`
`

`

`Genomic enrichment methods comparison
`
`Genotype concordance
`
`One of the most critical metrics we
`assessed was accuracy of the assigned ge-
`notypes. Since there is no ‘‘gold standard’’
`data set of known genotypes at every
`genomic position for a sample that we se-
`quenced, we evaluated accuracy by mea-
`suring concordance of our genotype as-
`signments with two different data sets. The
`first comparison data set was genotypes
`derived from analysis with the Infinium
`1M SNP BeadChip (1M chip). The second
`was genotypes derived by aligning the
`30-fold coverage whole-genome shotgun
`(WG) sequence data generated from Hap-
`Map sample NA18507 (Bentley et al. 2008).
`We used ELAND and called genotypes us-
`ing MPG at the same thresholds as those
`used with our capture data. In each case, we
`compared genotype assignments at ROI
`bases where the assessed quality of the data
`from the capture method and the com-
`parison data set was high.
`For the comparison with the 1M
`chip data, we compared genotypes for the
`two samples: NA18507 and NA12878.
`Each of the three capture methods
`achieved a genotype concordance of >99.84% (Table 3). Similar
`concordances were seen at heterozygous sites, which are notably
`more challenging to assign than homozygous sites; specifically,
`the concordance rate at positions found to be heterozygous with
`the 1M chip was >99.57% for all methods (Table 3). Thus genotype
`assignments from targeted capture sequence data are highly ac-
`curate, even at variant positions.
`Since only about 2600 genotyped bases from the 1M chip data
`were represented in our ROI, we also compared the genotypes
`from the various capture methods to those derived from the WG
`NA18507 data, which had genotype assignments for 67.3%
`(1.76/2.61 Mb) of the ROI bases. These WG data provided a larger
`basis for comparison, including bases not known to be poly-
`morphic. All three methods yielded genotypes with >99.998%
`concordance with the WG genotypes. When considering only the
`heterozygous sites in the WG data, the concordance was lower
`than it was for all sites: 98.111% for MIP, 98.735% for MGS, and
`99.609% for SHS.
`
`Figure 4. ROI regions with genotype assignments. The Venn diagram of overlapping genotype
`coverage is area proportional. Colored rectangles identify the proportion of genotype assignments in
`the ROI for each method: (red) MIP; (green) SHS; (blue) MGS. Note that the greatest overlap is among
`all three methods, and the second greatest is between SHS and MGS. The numbers sum to 95.35%
`because 4.65% of the ROI was not assigned a genotype in any method.
`
`similarity is the avoidance of repetitive sequence during probe
`design. Our ROI contains ;60 kb that overlaps with the Segmental
`Duplication Track from the UCSC Genome Browser (Bailey et al.
`2001). Of these bases, we were unable to assign genotypes for
`;19 kb with SHS, ;22 kb with MIP, and ;31 kb with MGS. The
`high percentage of bases without genotype assignments in these
`regions confirms that all methods are avoiding repeats as part of
`probe design, with MGS being slightly more stringent.
`Unassigned genotypes specific to each method may have
`resulted from limitations of probe design, low depth of coverage
`(due to capture and/or sequencing issues), or a combination of
`these problems. We analyzed the GC content of the bases that were
`targeted by the capture probes, but where no genotypes were
`assigned. The GC content of unassigned SHS bases was 67%, MIP
`was 61%, and MGS was 58%, compared to 50.5% for the overall
`ROI. Thus, for all methods, genotypes were not assigned for a
`similar set of targeted bases, and these bases were slightly biased
`toward a high GC content.
`
`Table 3. Genotype concordance
`
`Filtered
`
`1M SNP BeadChip genotypes concordant
`with capture genotypes (%)
`
`30-fold coverage whole-genome genotypes concordant
`with capture genotypes (%)
`
`Sample
`
`Method
`
`Reads
`(M)
`
`Bases
`(Mb)
`
`All
`genotypes
`
`Hets in
`1M chip
`
`Hets in
`capture
`
`All
`genotypes
`
`Hets in whole
`genome
`
`NA18507
`NA12878
`NA18507
`NA12878
`NA18507
`NA12878
`
`MIP
`MIP
`SHS
`SHS
`MGS
`MGS
`
`29.3
`29.7
`14.2
`13.9
`9.8
`14.3
`
`1055
`1069
`511
`500
`412
`601
`
`99.955
`100.000
`99.839
`99.960
`99.920
`99.922
`
`100.000
`100.000
`99.566
`99.807
`99.780
`99.625
`
`99.756
`100.000
`99.566
`100.000
`99.781
`100.000
`
`99.998
`—
`99.999
`—
`99.999
`—
`
`98.111
`—
`99.609
`—
`98.735
`—
`
`Hets in
`capture
`
`99.296
`—
`98.840
`—
`98.928
`—
`
`Concordance among capture methods and both 1M BeadChip and 30-fold coverage whole-genome standards. Concordance was calculated at all
`positions; heterozygous positions (Hets) in BeadChip and whole genome, and heterozygous positions in capture methods.
`
`Genome Research
`www.genome.org
`
`1425
`
`PGDX EX. 1018
`Page 6 of 12
`
`

`

`Teer et al.
`
`To assess the false-positive rate, we compared bases found to
`be heterozygous in the three capture methods with 1M chip and
`WG data (Table 3). Depending on the method, 99.57%–99.78% of
`the heterozygous genotypes derived from capture data agreed with
`the 1M chip genotypes for NA18507 (one to two discordant posi-
`tions in each method), while 100% concordant genotypes were
`seen for NA12878 over the 400–500 positions compared. Further
`comparison with the more comprehensive WG data for NA18507
`showed slightly lower concordance: 98.840% for SHS, 98.928% for
`MGS, and 99.296% for MIP over the approximately 1000 positions
`compared. There were six to 12 discordant positions in each
`method when compared to WG data. Thus, the false-positive rate
`was low (<1.2%) for all three methods. We conclude that all three
`capture methods, when coupled with adequate sequence depth of
`coverage, yield highly accurate genotype calls.
`
`Alignment artifacts affecting genotype assignments
`
`The data presented here were generated using standard ELAND
`alignments. One significant limitation of the versions of ELAND
`we used (v1.3.4 and v1.4.0) is that it is a gapless aligner; it does not
`consider insertions or deletions. Thus, not only are deletion/in-
`sertion variants (DIVs) not detected, but errors can be also in-
`troduced at these sites. When a sequence read spans a DIV, every
`base after the start of the DIV can appear as a mismatch. When this
`occurs within the first 32 bases of a sequence read, it will fail to
`align due to the high number of mismatches. However, when this
`occurs beyond the first 32 bases, ELAND will align the read and
`note the position(s) of the mismatch(es). These mismatches then
`appear to the genotype-assigning software as closely clustered
`nonreference variants (in reality, false-positives), not as DIVs.
`To detect DIVs, we used a gapped aligner, cross_match (http://
`www.phrap.org), to align those reads not aligned by ELAND. We
`then

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket