`Systematic comparison of three genomic enrichment
`methods for massively parallel DNA sequencing
`Jamie K. Teer,1,3 Lori L. Bonnycastle,1,3 Peter S. Chines,1 Nancy F. Hansen,1
`Natsuyo Aoyama,2 Amy J. Swift,1 Hatice Ozel Abaan,1 Thomas J. Albert,2
`NISC Comparative Sequencing Program,1 Elliott H. Margulies,1 Eric D. Green,1
`Francis S. Collins,1,4 James C. Mullikin,1,4 and Leslie G. Biesecker1,4
`1National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA; 2Roche NimbleGen
`Inc., Madison, Wisconsin 53719, USA
`
`Massively parallel DNA sequencing technologies have greatly increased our ability to generate large amounts of se-
`quencing data at a rapid pace. Several methods have been developed to enrich for genomic regions of interest for targeted
`sequencing. We have compared three of these methods: Molecular Inversion Probes (MIP), Solution Hybrid Selection
`(SHS), and Microarray-based Genomic Selection (MGS). Using HapMap DNA samples, we compared each of these
`methods with respect to their ability to capture an identical set of exons and evolutionarily conserved regions associated
`with 528 genes (2.61 Mb). For sequence analysis, we developed and used a novel Bayesian genotype-assigning algorithm,
`Most Probable Genotype (MPG). All three capture methods were effective, but sensitivities (percentage of targeted bases
`associated with high-quality genotypes) varied for an equivalent amount of pass-filtered sequence: for example, 70%
`(MIP), 84% (SHS), and 91% (MGS) for 400 Mb. In contrast, all methods yielded similar accuracies of >99.84% when
`compared to Infinium 1M SNP BeadChip-derived genotypes and >99.998% when compared to 30-fold coverage whole-
`genome shotgun sequencing data. We also observed a low false-positive rate with all three methods; of the heterozygous
`positions identified by each of the capture methods, >99.57% agreed with 1M SNP BeadChip, and >98.840% agreed with
`the whole-genome shotgun data. In addition, we successfully piloted the genomic enrichment of a set of 12 pooled samples
`via the MGS method using molecular bar codes. We find that these three genomic enrichment methods are highly
`accurate and practical, with sensitivities comparable to that of 30-fold coverage whole-genome shotgun data.
`
`[Supplemental material is available online at http://www.genome.org. The sequence data from this study have been
`submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no.
`SRA022076. Bam2mpg is freely available at http://research.nhgri.nih.gov/software/bam2mpg.]
`
`The ability to identify genetic alterations underlying disease and
`phenotypic variation is a major goal of the ongoing genomics
`revolution. Large-scale medical sequencing holds the promise of
`elucidating the genetic architecture of virtually all diseases, in-
`cluding the relative role of rare and common genetic variants. Such
`information should inform our understanding of the pathways
`involved in disease pathogenesis. Additionally, the ability to apply
`variant discovery to other organisms with annotated genomes is
`also of great value. So-called next-generation DNA sequencing
`technologies that involve massively parallel clonal ensemble se-
`quencing are creating new opportunities for comprehensive ge-
`nomic interrogation, bypassing some of the limitations associated
`with high-throughput electrophoresis-based sequencing methods
`(Margulies et al. 2005; Shendure et al. 2005; Bentley et al. 2008).
`Although advances in these new technologies are being made at
`a rapid pace, the cost of sequencing a whole human genome and
`the associated costs and technical hurdles of data storage and
`analysis remain high for most large projects. As a result, methods
`
`3These authors contributed equally to this work.
`4Corresponding authors.
`E-mail collinsf@mail.nih.gov; fax (301) 402-2218.
`E-mail mullikin@mail.nih.gov; fax (301) 480-0634.
`E-mail leslieb@helix.nih.gov; fax (301) 402-2170.
`Article published online before print. Article and publication date are at
`http://www.genome.org/cgi/doi/10.1101/gr.106716.110.
`
`are needed for efficient comprehensive sequencing of targeted
`genomic regions.
`A number of genomic enrichment (or targeted capture)
`methods have been developed and used with varying levels of
`success (for reviews, see Garber 2008; Summerer 2009; Turner et al.
`2009b; Mamanova et al. 2010). Not surprisingly, each method has
`both common and unique issues related to the required input
`DNA, specificity and coverage, sensitivity and accuracy, scalability
`and potential for automation, and cost-effectiveness. Although
`multiple publications describe the individual methods (Albert et al.
`2007; Hodges et al. 2007; Okou et al. 2007; Porreca et al. 2007;
`Craig et al. 2008; Krishnakumar et al. 2008; Bau et al. 2009; Gnirke
`et al. 2009; Herman et al. 2009; Hodges et al. 2009; Li et al. 2009;
`Summerer et al. 2009; Tewhey et al. 2009), numerous variables are
`inherent to each experiment, including the nature of the targeted
`genomic region(s), sequencing platform, analysis software, and
`performance metrics; these variables make objective comparisons
`of the different methods challenging to perform.
`Here, we report the testing, optimizing, and rigorous com-
`paring of three genomic enrichment methods: Molecular Inver-
`sion Probes (MIP) (Porreca et al. 2007), Solution Hybrid Selection
`(SHS; Agilent) (Gnirke et al. 2009), and Microarray-based Genomic
`Selection (MGS; Roche-NimbleGen) (Albert et al. 2007; Okou et al.
`2007). All three methods were tested for their ability to capture the
`same 2.61 Mb of noncontiguous DNA sequence from an over-
`lapping set of two HapMap DNA samples. We also report, for the
`
`1420 Genome Research
`www.genome.org
`
`20:1420–1431; ISSN 1088-9051/10; www.genome.org
`
`PGDX EX. 1018
`Page 1 of 12
`
`
`
`first time, a novel Bayesian genotype-assigning algorithm, Most
`Probable Genotype (MPG), which was used to analyze the se-
`quence data from all three capture methods. Furthermore, we in-
`troduce a pre-capture bar-coding strategy to allow pooling of
`samples for capture and sequencing. In contrast to previous re-
`views of genomic enrichment methods, we have directly com-
`pared these methods by evaluating genotype sensitivity and ac-
`curacy of variant detection, providing insights about overall
`performance and relative costs.
`
`Results
`
`Both of the solution-based methods—MIP (Fig. 1A) and SHS (Fig.
`1B)—are highly multiplexed capture methods that can enrich for
`a large number of genomic regions. We have incorporated pub-
`lished improvements to the original MIP-based method (Li et al.
`2009; Turner et al. 2009a), including improvements to facilitate
`automated and effective probe design; accuracy of the MIP-based
`method has also been improved by segregating overlapping probes
`to two separate probe sets. The 384K MGS array (Fig. 1C) also al-
`lows for highly multiplexed capture of genomic regions in a single
`hybridization but uses a solid-phase DNA-oligonucleotide hy-
`bridization step; we adapted the original MGS-based protocol
`(Albert et al. 2007) to increase capture efficiency and to allow for
`sample pooling to reduce costs. There were a few considerations
`specific to several of the methods that led us to modify or adapt
`these methods for the purposes of this study. These are described
`below, before the overall results of the study are described.
`
`Molecular Inversion Probes (MIPs)
`
`In the process of implementing an MIP-based genomic enrich-
`ment method, we found that our sequence data included the probe
`arm sequences that were designed to hybridize to DNA flanking
`the region of interest (ROI) (data not shown). This occurred when
`probes were required to overlap each other in order to capture a
`long region. Because these arm sequences of the MIP probes were
`sequenced and aligned, they could falsely increase the proportion
`of the reference allele and yield false-negative results. In addition
`to the probe arm sequences, some MIP backbone sequence (non-
`human) was also present and aligned to the human reference se-
`quence with several mismatches, which generated significant
`numbers of false-positive variants.
`One approach for avoiding such artifacts involves end-se-
`quencing the captured DNA fragments, thus avoiding shotgun li-
`brary preparation and the inclusion of probe backbone sequence.
`The increasing capabilities of sequencers to generate longer read
`lengths make this approach more feasible, which may improve
`alignment rates and provide better coverage and higher sensitiv-
`ities (Turner et al. 2009a). We tested this strategy by designing
`probes to target ROI fragments of about 105 bases with 76-base
`end-sequences; however, only 55% of the ROI bases could be
`covered by this design, which we concluded was unacceptable.
`We avoided these artifacts by dividing the capture MIPs into
`two probe sets. Although the two probe sets overlapped each other,
`the probes within each set were nonoverlapping. Within a non-
`overlapping set, the targeting arm and probe backbone sequences
`are distinguishable from the captured DNA and thus can be ex-
`cluded following alignment. For the entire capture-to-sequencing
`process, each of the two probe sets was treated independently.
`After sequencing and alignment, the targeting arm and backbone
`bases were computationally removed, and the data were combined.
`
`Genomic enrichment methods comparison
`
`Figure 1. Three genomic enrichment methods. (A) (MIP) Molecular
`Inversion Probe: 70 base probes are prepared and hybridized to genomic
`DNA. Capture occurs by filling in sequence between the probe-targeting
`arms with polymerase and then sealing the circle with ligase. Total ge-
`nomic DNA is removed with nucleases. The remaining closed circles un-
`dergo shotgun library and sequencing library preparation, followed by
`sequencing. (B) (SHS) Solution Hybrid Selection: A sequencing library is
`prepared from genomic DNA. This library is hybridized to biotinylated
`RNA probes in solution and recovered with streptavidin beads. Eluted
`products are amplified prior to cluster generation and sequencing. (C )
`(MGS) Microarray-based Genomic Selection: A sequencing library is pre-
`pared from genomic DNA and hybridized to a capture array. Eluted
`products are amplified prior to cluster generation and sequencing.
`
`Genome Research
`www.genome.org
`
`1421
`
`PGDX EX. 1018
`Page 2 of 12
`
`
`
`Teer et al.
`
`As this strategy required the separate sequencing of the DNA
`captured from each probe set, we needed two lanes, which gener-
`ated twice as much sequence data for the MIP method compared
`to the other two; we used this full amount of sequence (;1.1 Gb)
`in our comparison analyses throughout this study. Future appli-
`cation of this strategy could include indexing of each probe set
`to allow merging of the individually captured DNA from each set,
`eliminating the need for two separate sequencing lanes (and the
`extra sequence) for one sample.
`
`Solution Hybrid Selection (SHS)
`
`Following the initial description of SHS (Gnirke et al. 2009),
`Agilent Technologies released a commercial version (SureSelect).
`Due to improvements communicated by the vendor, we chose to
`implement SHS using these kits and performed the captures as
`described by the manufacturer.
`
`MGS and pooled bar-coded samples
`
`Considering the high cost of MGS (Roche NimbleGen) arrays and
`the increasing single-lane sequencing capacity of next-generation
`sequencing platforms, we developed a novel strategy that enables
`the efficient capture of a pool of 12 individually bar-coded Illu-
`mina paired-end (PE) libraries. In contrast to most methods used
`for bar-coding samples for pool sequencing, this strategy is opti-
`mized for use in targeted hybrid capture, followed by sequencing.
`Specifically, we made a 6-base indexed paired-end library for each
`of 12 genomic DNA samples (including the same two HapMap
`samples as were used in the MIP and SHS methods) and simulta-
`neously performed MGS of all 12 samples with one array. The in-
`dex served as a bar code and provided individual-specific sequence
`data for the array-enriched targets after sequencing. Reads were
`assigned to each bar code if the index sequence had no more than
`one mismatch. Of the reads that passed the quality filter (Illumina
`GERALD chastity filter), we were able to assign ;99% to a unique
`sample using this indexing strategy. A narrow distribution of se-
`quencing depth across samples, or sample uniformity, is critical
`when sequencing bar-coded pools to avoid over-representation
`
`Table 1.
`
`Indexed samples pooled for MGS capture and sequencing
`
`of some samples at the expense of others. Poor sample unifor-
`mity would require additional sequencing to provide adequate
`depth of coverage of the samples with lower representation. Per-
`fect sample uniformity would yield 8.3% of filtered reads origi-
`nating from each of the 12 samples. The difference between the
`bar-coded samples with the highest (11.4%) and lowest (5.0%)
`number of filtered reads was slightly over twofold (Table 1). We
`conclude that this level of sample uniformity is acceptable for
`many potential applications of this target enrichment method.
`Another important aspect of sequencing pooled samples is
`to have each ROI base covered to a similar level across all samples
`in the pool. This allows for consistent interrogation of the same
`ROI bases for all samples and facilitates follow-up attempts to
`re-sequence only the smaller subset of missed regions. We found
`that samples were fairly uniform with respect to well-covered and
`poorly covered bases. Approximately 78% of the ROI bases had
`good coverage ($20-fold redundancy in all 12 samples), 20% had
`variable coverage (one or more of the indexed samples had one- to
`19-fold coverage), and 2% had no coverage in any sample. Bases
`with no coverage were mainly those for which probes could not
`be designed.
`
`Input DNA requirements
`
`Sequencing projects often involve analyzing samples with limited
`DNA resources. We performed MIP capture with 1 mg of input
`DNA, the lowest amount of the three methods. The SHS and MGS
`protocols required 3 mg and 4 mg of DNA, respectively. Of note, the
`modified MGS protocol described here required 4 mg of starting
`DNA to generate a library suitable for capture and subsequent se-
`quencing on the Illumina GAII platform. Although this reflects an
`80% reduction in the amount of input DNA recommended by the
`Roche/NimbleGen MGS protocol (4 vs. 20 mg), we believe that an
`even smaller amount of DNA may work as well. We started by
`shearing 4 mg of input DNA to account for the variability and
`inaccuracies of measured DNA concentration values; the sub-
`sequent library preparation steps (end repair and adapter ligation)
`proceeded with only 2 mg of sheared DNA. Thus, to further reduce
`input DNA requirement, more attention could be placed on
`
`Index
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`Unassigned
`All
`Average
`
`Reads
`(M)
`
`17.1
`15.7
`14.3
`17.5
`13.7
`13.3
`12.5
`11.2
`9.8
`10.1
`9.7
`7.6
`1.0
`153.6
`12.7
`
`Pass filter
`
`Percent
`of all
`reads
`
`11.2
`10.2
`9.3
`11.4
`8.9
`8.6
`8.2
`7.3
`6.4
`6.6
`6.3
`5.0
`0.7
`
`Aligned
`
`Aligned to ROI
`
`Sequence
`(Mb)
`
`Percent
`of all
`reads
`
`Sequence
`(Mb)
`
`Percent
`of all
`reads
`
`Sequence
`(Mb)
`
`720
`658
`601
`737
`577
`558
`526
`469
`411
`423
`409
`321
`44
`6453
`534
`
`11.4
`10.2
`9.4
`11.4
`8.9
`8.7
`8.2
`7.3
`6.4
`6.6
`6.4
`5.0
`0.6
`
`560
`503
`464
`562
`441
`431
`404
`360
`318
`323
`315
`248
`29
`4961
`411
`
`12.3
`9.3
`9.7
`10.5
`8.6
`9.1
`8.3
`7.2
`7.0
`6.3
`6.5
`5.3
`0.7
`
`341
`258
`268
`290
`238
`251
`230
`200
`193
`175
`181
`146
`19
`2789
`231
`
`Coverage depth
`(Percent of bases in ROI)
`
`13
`
`103
`
`203
`
`96.6
`96.5
`96.5
`96.5
`96.3
`96.4
`96.3
`96.3
`96.2
`96.2
`96.0
`96.1
`
`94.3
`93.7
`93.7
`93.7
`93.4
`93.5
`93.2
`92.7
`92.3
`92.1
`91.9
`91.2
`
`91.9
`90.6
`90.5
`90.4
`89.9
`89.9
`89.5
`88.2
`87.2
`86.6
`86.2
`83.8
`
`Both 36 and 51 base reads were generated for MGS. Thirty-six bases of each MGS 51 base read were used. Total Illumina chastity filtered sequence counts
`for MGS include the 6-base index at the start of each read.
`
`1422 Genome Research
`www.genome.org
`
`PGDX EX. 1018
`Page 3 of 12
`
`
`
`re-quantifying DNA with more reliable, but more time-consuming,
`methods.
`
`Comparison of methods
`
`Our regions of interest comprised 2.61 Mb of noncontiguous hu-
`man genome sequence consisting of exons and conserved ele-
`ments associated with 528 genes (see Methods). Successful probe
`designs were achieved for 95.9% (2.506/2.612 Mb) of the ROI bases
`for the MIP, 92.6% (2.419/2.612 Mb) for the SHS, and 93.8%
`(2.450/2.612 Mb) for the MGS methods. Since sequencing projects
`aim to interrogate the entirety of the ROI, we assessed the perfor-
`mance metrics based on the entire 2.61 Mb of ROI rather than only
`on the regions for which suitable probes could be designed. Fur-
`thermore, analysis of the ROI across all three methods allowed
`us to make more appropriate comparisons for this study.
`We performed targeted enrichment and sequencing on mul-
`tiple samples for each of the methods. Although the majority of
`the samples from the MIP and SHS methods were distinct from
`those of the indexed samples pooled for the MGS capture, there
`were two HapMap samples in common across all three methods
`(NA18507 and NA12878). The amount of sequence generated for
`each of the indexed samples was comparable to that for the single
`sample experiments (Table 2). The data presented here compare
`only these two samples unless otherwise noted (i.e., ‘‘extended
`sample sets’’ refers to all samples).
`
`Fraction of sequences that align to ROI
`
`The first metric examined was the fraction of the captured DNA
`that aligned to the ROI. This metric reflects the ability of the
`method to enrich for appropriate targets and greatly influences the
`cost (as more sequencing is required for a lower aligned fraction).
`Approximately 10 to 14 million filtered reads were obtained for
`each of the two samples captured with either the SHS or MGS
`methods; in contrast, about 30 million filtered reads were obtained
`with the MIP method due to the need to generate two non-
`overlapping captures (see above) (Table 2). We used ELAND (Illu-
`mina, Inc.) to align the sequence reads to the reference human
`genome and then calculated depth of coverage for our ROI. The
`sequencing was performed at different times during the project,
`and the capability of the sequencer evolved during this time pe-
`riod. Consequently, the amount of generated sequence varied for
`a given number of reads among the methods. To equalize the
`comparisons, we only used a maximum of 36 bases of each se-
`quence read when calculating the various metrics in this analysis.
`
`Table 2. Genotype sensitivity
`
`Genomic enrichment methods comparison
`
`Furthermore, the total filtered sequence data count for MGS in-
`cluded the 6-base index at the beginning of each sequence read,
`accounting for the extra nontarget sequence. Although the pro-
`portion of filtered reads that aligned uniquely to the genome
`varied (76%–90%), the amount that aligned uniquely to the ROI
`was more consistent (52%–59%) for all methods and samples.
`A smaller fraction of MIP-generated reads aligned to the ge-
`nome as compared to SHS- or MGS-generated reads. Much of this
`difference was due to the MIP probe-backbone sequence, which
`is still in the library at this stage; it can only be computationally
`removed once the alignment is completed, and the captured and
`designed/backbone bases can be differentiated. As the backbone
`sequence is not human, it added many mismatches to a given
`alignment and prevented these reads from aligning to the genome.
`Even though a smaller fraction of MIP-generated reads aligned to the
`genome, a higher fraction of the genome-aligned reads (compared
`to the other two methods) overlapped the ROI, resulting in a total
`fraction of ROI-aligning reads that was similar to that seen with the
`other methods.
`
`Depth and uniformity of coverage
`
`Important metrics of the sequence generated by a capture method
`include the number of sequence reads that overlap each base across
`the ROI and the uniformity of that coverage depth. This is because
`accurate genotype assignments depend on a minimum coverage
`depth (typically 10- to 20-fold) at each base. Uneven depth of cov-
`erage across ROI, or poor uniformity, creates the need for more se-
`quencing to provide sufficient data for poorly represented regions.
`The proportion of ROI bases with adequate depth of coverage
`was consistent across samples for a given method but varied
`among the methods (ranging from ;79%–94% with $10-fold and
`;74%–91% with $20-fold depth of coverage) (Table 2). A greater
`proportion of the ROI bases had at least 10-fold coverage depth for
`MGS, especially compared to the MIP method. This was likely due
`to differences in the uniformity of coverage for the different
`methods (Fig. 2). The distribution for MIP-derived sequence depth
`of coverage was spread over a wide range (one- to 1937-fold, me-
`dian at 157-fold); this was less the case for SHS (one- to 675-fold,
`median at 101-fold) and MGS (one- to 481-fold, median at 66-
`fold). In general, as the depth of coverage distribution widens,
`bases with excessively high read depth account for a large share
`of the sequence reads, necessitating more sequence data to ensure
`that less efficiently captured regions reach the threshold for a high-
`quality genotype assignment. Due to its more uniform capture,
`MGS achieved high-quality genotype assignments at more ROI
`
`Filtered
`
`Aligned
`
`Aligned to ROI
`
`Sample
`
`Method
`
`Reads
`(M)
`
`Bases
`(Mb)
`
`Filtered
`reads (%)
`
`Bases
`(Mb)
`
`Filtered
`reads (%)
`
`Bases
`(Mb)
`
`NA18507
`NA12878
`NA18507
`NA12878
`NA18507
`NA12878
`
`MIP
`MIP
`SHS
`SHS
`MGS
`MGS
`
`29.3
`29.7
`14.2
`13.9
`9.8
`14.3
`
`1055
`1069
`511
`500
`412
`601
`
`78.2
`76.4
`90.8
`89.9
`90.1
`89.7
`
`824
`817
`464
`450
`318
`462
`
`57.0
`56.2
`59.2
`55.4
`54.7
`52.0
`
`601
`601
`302
`277
`193
`268
`
`Coverage of ROI (%)
`
`Sensitivity (%)
`
`103
`
`79.2
`79.0
`87.2
`86.5
`92.3
`93.7
`
`203
`
`74.1
`73.8
`83.0
`82.0
`87.2
`90.6
`
`ROI called
`
`78.0
`77.9
`86.2
`85.3
`91.3
`93.0
`
`Thirty-six base reads were generated for MIP and SHS, whereas both 36 and 51 base reads were generated for MGS. To make MGS values more
`comparable with MIP and SHS data and consistent with other data reported in this paper, only 36 bases of each MGS 51 base read were used. Total
`Illumina chastity filtered sequence counts for MGS include the 6-base index at the start of each read.
`
`Genome Research
`www.genome.org
`
`1423
`
`PGDX EX. 1018
`Page 4 of 12
`
`
`
`Teer et al.
`
`Figure 2. Depth of coverage distribution. Distributions of depth of coverage at each ROI position. Scales have been standardized for comparison
`purposes, and maximum coverage depth values are indicated above the arrow. (A) (MIP) Molecular Inversion Probe. (B) (SHS) Solution Hybrid Selection.
`(C ) (MGS) Microarray-based Genomic Selection.
`
`egies) for a specific project. We analyzed the genotype assignments
`for all three methods at each ROI base and determined the number
`of positions in NA18507 covered by all, two, or one method. There
`was considerable overlap among the three methods. Specifically,
`the same 70.33% of the ROI bases were assigned a genotype by all
`three methods, an additional 25.02% was assigned by at least one
`method, and 4.65% were not assigned by any method (Fig. 4).
`Overall, SHS and MGS had the greatest overlap, sharing genotype
`assignments for ;83% of the bases of the ROI. We would expect by
`random chance that ;0.2% of our ROI would be unassigned in all
`three methods, which is much less than the observed value
`(4.65%). Similarly, ;61% of the ROI is expected by chance to have
`genotype assignments in all three methods, which is also less
`than the observed value (70.33%). This suggests that similarities
`among the three methods contribute to the observed overlap. One
`
`bases. The uniformity for SHS was similar to that of MGS, and its
`10-fold and 20-fold coverage depth statistics were almost as high as
`for MGS. The uniformity for MIP was very broad, resulting in fewer
`high-quality genotype assignments.
`
`Genotype sensitivity
`
`We used a new Bayesian genotype-assigning algorithm, Most
`Probable Genotype (MPG), to determine genotypes. Specifically,
`we only accepted genotype assignments with an MPG prediction
`score of 10 or greater, which we have empirically found to yield
`a balance between sensitivity and accuracy (see Methods and Sup-
`plemental Methods). An MPG score of 10 corresponded roughly
`with 103 to 203 depth of coverage. We defined genotype sensi-
`tivity as the percentage of ROI bases with MPG assignment scores
`of $10. The MIP method had the lowest genotype sensitivity at
`;78% even with twice the amount of sequence data (Table 2). The
`other two methods were similar to each other, with MGS slightly
`more sensitive (;93%) than SHS (;86%). This high genotype
`sensitivity was consistently achieved, even when examining an
`extended set of samples (Fig. 3). We have applied each of these
`methods to other projects with multiple samples and have ob-
`served a similarly low variance in genotype sensitivity among
`samples within each project (data not shown). These results are
`especially encouraging since the probes could be designed to
`capture only 93%–96% of the ROI bases for each of the latter two
`methods. For example, the MGS probe design covered 94.0% of the
`ROI bases, and this method had genotype sensitivity of up to
`93.0% overall, which suggests that almost all of the designed target
`bases yielded high-quality genotypes. Similarly, the SHS probe
`design covered 92.6% of the ROI bases and had a genotype sensi-
`tivity of 86.2%, assigning a high-quality genotype to the vast
`majority of the targeted bases. Thus, we can conclude that ge-
`notype sensitivity was good and consistent for all three methods,
`but that MGS had the highest genotype sensitivity.
`
`Bases with no genotype assignments
`
`We also determined the characteristics of the ROI bases with no
`genotype assignments. Such information may prove useful for
`choosing a genomic enrichment method (or combination of strat-
`
`Figure 3. Genotype sensitivity across multiple samples. Boxplots show-
`ing distribution of genotype call sensitivities across multiple samples (ex-
`tended sample set) for each capture method. (N ) Number of samples.
`
`1424 Genome Research
`www.genome.org
`
`PGDX EX. 1018
`Page 5 of 12
`
`
`
`Genomic enrichment methods comparison
`
`Genotype concordance
`
`One of the most critical metrics we
`assessed was accuracy of the assigned ge-
`notypes. Since there is no ‘‘gold standard’’
`data set of known genotypes at every
`genomic position for a sample that we se-
`quenced, we evaluated accuracy by mea-
`suring concordance of our genotype as-
`signments with two different data sets. The
`first comparison data set was genotypes
`derived from analysis with the Infinium
`1M SNP BeadChip (1M chip). The second
`was genotypes derived by aligning the
`30-fold coverage whole-genome shotgun
`(WG) sequence data generated from Hap-
`Map sample NA18507 (Bentley et al. 2008).
`We used ELAND and called genotypes us-
`ing MPG at the same thresholds as those
`used with our capture data. In each case, we
`compared genotype assignments at ROI
`bases where the assessed quality of the data
`from the capture method and the com-
`parison data set was high.
`For the comparison with the 1M
`chip data, we compared genotypes for the
`two samples: NA18507 and NA12878.
`Each of the three capture methods
`achieved a genotype concordance of >99.84% (Table 3). Similar
`concordances were seen at heterozygous sites, which are notably
`more challenging to assign than homozygous sites; specifically,
`the concordance rate at positions found to be heterozygous with
`the 1M chip was >99.57% for all methods (Table 3). Thus genotype
`assignments from targeted capture sequence data are highly ac-
`curate, even at variant positions.
`Since only about 2600 genotyped bases from the 1M chip data
`were represented in our ROI, we also compared the genotypes
`from the various capture methods to those derived from the WG
`NA18507 data, which had genotype assignments for 67.3%
`(1.76/2.61 Mb) of the ROI bases. These WG data provided a larger
`basis for comparison, including bases not known to be poly-
`morphic. All three methods yielded genotypes with >99.998%
`concordance with the WG genotypes. When considering only the
`heterozygous sites in the WG data, the concordance was lower
`than it was for all sites: 98.111% for MIP, 98.735% for MGS, and
`99.609% for SHS.
`
`Figure 4. ROI regions with genotype assignments. The Venn diagram of overlapping genotype
`coverage is area proportional. Colored rectangles identify the proportion of genotype assignments in
`the ROI for each method: (red) MIP; (green) SHS; (blue) MGS. Note that the greatest overlap is among
`all three methods, and the second greatest is between SHS and MGS. The numbers sum to 95.35%
`because 4.65% of the ROI was not assigned a genotype in any method.
`
`similarity is the avoidance of repetitive sequence during probe
`design. Our ROI contains ;60 kb that overlaps with the Segmental
`Duplication Track from the UCSC Genome Browser (Bailey et al.
`2001). Of these bases, we were unable to assign genotypes for
`;19 kb with SHS, ;22 kb with MIP, and ;31 kb with MGS. The
`high percentage of bases without genotype assignments in these
`regions confirms that all methods are avoiding repeats as part of
`probe design, with MGS being slightly more stringent.
`Unassigned genotypes specific to each method may have
`resulted from limitations of probe design, low depth of coverage
`(due to capture and/or sequencing issues), or a combination of
`these problems. We analyzed the GC content of the bases that were
`targeted by the capture probes, but where no genotypes were
`assigned. The GC content of unassigned SHS bases was 67%, MIP
`was 61%, and MGS was 58%, compared to 50.5% for the overall
`ROI. Thus, for all methods, genotypes were not assigned for a
`similar set of targeted bases, and these bases were slightly biased
`toward a high GC content.
`
`Table 3. Genotype concordance
`
`Filtered
`
`1M SNP BeadChip genotypes concordant
`with capture genotypes (%)
`
`30-fold coverage whole-genome genotypes concordant
`with capture genotypes (%)
`
`Sample
`
`Method
`
`Reads
`(M)
`
`Bases
`(Mb)
`
`All
`genotypes
`
`Hets in
`1M chip
`
`Hets in
`capture
`
`All
`genotypes
`
`Hets in whole
`genome
`
`NA18507
`NA12878
`NA18507
`NA12878
`NA18507
`NA12878
`
`MIP
`MIP
`SHS
`SHS
`MGS
`MGS
`
`29.3
`29.7
`14.2
`13.9
`9.8
`14.3
`
`1055
`1069
`511
`500
`412
`601
`
`99.955
`100.000
`99.839
`99.960
`99.920
`99.922
`
`100.000
`100.000
`99.566
`99.807
`99.780
`99.625
`
`99.756
`100.000
`99.566
`100.000
`99.781
`100.000
`
`99.998
`—
`99.999
`—
`99.999
`—
`
`98.111
`—
`99.609
`—
`98.735
`—
`
`Hets in
`capture
`
`99.296
`—
`98.840
`—
`98.928
`—
`
`Concordance among capture methods and both 1M BeadChip and 30-fold coverage whole-genome standards. Concordance was calculated at all
`positions; heterozygous positions (Hets) in BeadChip and whole genome, and heterozygous positions in capture methods.
`
`Genome Research
`www.genome.org
`
`1425
`
`PGDX EX. 1018
`Page 6 of 12
`
`
`
`Teer et al.
`
`To assess the false-positive rate, we compared bases found to
`be heterozygous in the three capture methods with 1M chip and
`WG data (Table 3). Depending on the method, 99.57%–99.78% of
`the heterozygous genotypes derived from capture data agreed with
`the 1M chip genotypes for NA18507 (one to two discordant posi-
`tions in each method), while 100% concordant genotypes were
`seen for NA12878 over the 400–500 positions compared. Further
`comparison with the more comprehensive WG data for NA18507
`showed slightly lower concordance: 98.840% for SHS, 98.928% for
`MGS, and 99.296% for MIP over the approximately 1000 positions
`compared. There were six to 12 discordant positions in each
`method when compared to WG data. Thus, the false-positive rate
`was low (<1.2%) for all three methods. We conclude that all three
`capture methods, when coupled with adequate sequence depth of
`coverage, yield highly accurate genotype calls.
`
`Alignment artifacts affecting genotype assignments
`
`The data presented here were generated using standard ELAND
`alignments. One significant limitation of the versions of ELAND
`we used (v1.3.4 and v1.4.0) is that it is a gapless aligner; it does not
`consider insertions or deletions. Thus, not only are deletion/in-
`sertion variants (DIVs) not detected, but errors can be also in-
`troduced at these sites. When a sequence read spans a DIV, every
`base after the start of the DIV can appear as a mismatch. When this
`occurs within the first 32 bases of a sequence read, it will fail to
`align due to the high number of mismatches. However, when this
`occurs beyond the first 32 bases, ELAND will align the read and
`note the position(s) of the mismatch(es). These mismatches then
`appear to the genotype-assigning software as closely clustered
`nonreference variants (in reality, false-positives), not as DIVs.
`To detect DIVs, we used a gapped aligner, cross_match (http://
`www.phrap.org), to align those reads not aligned by ELAND. We
`then