throbber
Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Review
`
`Key Principles and Clinical Applications of "Next-Generation"
`DNA Sequencing
`
`Cancer
`Prevention
`Research
`
`Jason M. Rizzo and Michael J. Buck
`
`Abstract
`
`Demand for fast, inexpensive, and accurate DNA sequencing data has led to the birth and dominance of
`a new generation of sequencing technologies. So-called "next-generation" sequencing technologies enable
`rapid generation of data by sequencing massive amounts of DNA in parallel using diverse methodologies
`which overcome the limitations of Sanger sequencing methods used to sequence the first human genome.
`Despite opening new frontiers of genomics research, the fundamental shift away from the Sanger
`sequencing that next-generation technologies has created has also left many unaware of the capabilities
`and applications of these new technologies, especially those in the clinical realm. Moreover, the brisk
`evolution of sequencing technologies has flooded the market with commercially available sequencing
`platforms, whose unique chemistries and diverse applications stand as another obstacle restricting the
`potential of next-generation sequencing. This review serves to provide a primer on next-generation
`sequencing technologies for clinical researchers and physician scientists. We provide an overview of the
`capabilities and clinical applications of DNA sequencing technologies to raise awareness among researchers
`about the power of these novel genomic tools. In addition, we discuss that key sequencing principles provide
`a comparison between existing and near-term technologies and outline key advantages and disadvantages
`between different sequencing platforms to help researchers choose an appropriate platform for their
`research interests. Cancer Prev Res; 5(7); 887–900. Ó2012 AACR.
`
`Introduction
`Initial sequencing of the human genome took more than
`a decade and cost an estimated $70 million dollars (1).
`Sequencing for the Human Genome Project (HGP) relied
`primarily on automation of sequencing methods first intro-
`duced by Sanger in 1977 (2). Despite the successful use this
`technology to generate early maps of the human genome
`(3–5), the limitations of Sanger sequencing created a high
`demand for more robust sequencing technologies capable
`of generating large amounts of data, quickly, and at lower
`costs.
`Recognizing this need, the National Human Genome
`Research Institute (NHGRI) initiated a funding program
`in 2004 aimed at catalyzing sequencing technology devel-
`opment and with a goal of reducing the cost of genome
`sequencing to$100,000 in 5 years and, ultimately, $1,000
`in 10 years (6–8). The initiative has been widely successful
`to date, and a bevy of new technologies has emerged in the
`sequencing marketplace over the past 5 years. New tech-
`
`Authors' Affiliation: Department of Biochemistry and Center of Excellence
`in Bioinformatics and Life Sciences, State University of New York at Buffalo,
`Buffalo, New York
`
`Corresponding Author: Michael J. Buck, State University of New York at
`Buffalo, 701 Elicott St., Buffalo, NY 14203. Phone: 716-881-7569; Fax: 716-
`849-6655; E-mail: mjbuck@buffalo.edu
`
`doi: 10.1158/1940-6207.CAPR-11-0432
`Ó2012 American Association for Cancer Research.
`
`nologies offer radically different approaches that enable the
`sequencing of large amounts of DNA in parallel and at
`substantially lower costs than conventional methods. The
`terms "next-generation sequencing" and "massive-parallel
`sequencing" have been used loosely to collectively refer to
`these new high-throughput technologies.
`
`First-Generation Sequencing
`Automated Sanger sequencing is now considered the
`"first-generation" of DNA sequencing technologies. Techni-
`cally,
`standard Sanger
`sequencing
`identifies
`linear
`sequences of nucleotides by electrophoretic separation of
`randomly terminated extension products (2). Automated
`methods use fluorescently labeled terminators, capillary
`electrophoresis separation, and automated laser signal
`detection for improved nucleotide sequence detection
`[ref. 9; for reviews, see the studies of Hutchinson (ref. 10)
`and Metzker (ref. 11)]. As a key strength, Sanger sequencing
`remains the most available technology today and its well-
`defined chemistry makes it is the most accurate method for
`sequencing available now. Sanger sequencing reactions can
`read DNA fragments of 500 bp to 1 kb in length, and this
`method is still used routinely for sequencing small amounts
`of DNA fragments and is the gold-standard for clinical
`cytogenetics (12).
`Despite strong availability and accuracy, however, Sanger
`sequencing has restricted applications because of technical
`limitations of its workflow. The main limitation of Sanger
`
`www.aacrjournals.org
`
`887
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 887
`
`FOUNDATION EXHIBIT 1037
`IPR2019-00634
`
`

`

`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Hypersensitive
`sites
`
`TFBSs
`
`CH3
`
`RNA-
`binding
`protein
`
`Histone modifications (e.g., acetylation, methylation)
`DNA modifications (e.g., methylation)
`E.g., splicing variants
`and copy number
`variation
`
`RNA
`polymerase
`
`Nucleosome
`(histone proteins)
`
`CH3CO
`
`CH3
`
`RIP-seq
`
`MNase-seq
`
`5c-seq
`
`DNase-seq
`FAIRE-seq
`
`Bisulfite-seq
`
`ChIP-seq
`
`Long-range regulatory elements
`(enhancers, repressors/
`silencers, insulators)
`
`Cis-regulatory elements
`(promoters, TFBSs)
`
`Transcripts
`
`RNA-seq
`
`RNA
`
`Reverse
`transcription
`
`Gene
`
`Intron (noncoding)
`Exon (coding region)
`
`“Exome” sequencing
`
`cDNA
`
`(dsDNA)
`
`Fragmentation and size selection
`
`Adapter ligation
`
`Amplification
`
`Template immobilization and spatial separation
`
`Targeted genomic/Epigenomic sequencing
`
`Rizzo and Buck
`
`Sample
`collection
`
`E.g., Normal tissue
`
`Source material
`
`E.g., Tumor sample
`
`WGS
`
`Template
`generation
`
`5´
`3´
`
`Single-end reads
`NGS
`Sequence read
`
`NGS
`Primer
`
`DNA sequence template
`
`3´
`5´
`
`Sequencing reactions and detection
`
`Paired-end reads
`
`Unsequenced
`
`Shorter DNA templates = Total coverage
`
`Longer DNA template = partial coverage
`
`Assemble NGS reads de novo
`
`Align NGS reads to reference genome
`
`Data analysis
`
`(+)
`
`(Reference genome)
`
`(–)
`
`(Assembled genome)
`
`888
`
`Cancer Prev Res; 5(7) July 2012
`
`Cancer Prevention Research
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 888
`
`

`

`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`sequencing is one of throughput, that is, the amount of
`DNA sequence that can be read with each sequencing
`reaction. Throughput is a function of sequencing reaction
`time, the number of sequencing reactions that can be run in
`parallel, and lengths of sequences read by each reaction. The
`requirement for electrophoretic separation of DNA frag-
`ments for reading DNA sequence content in Sanger-based
`sequencing is the primary bottleneck for throughput with
`this method, increasing time and limiting the number of
`reactions that can be run in parallel (13). Despite efficient
`automation, each Sanger instrument can only read 96
`reactions in parallel, and this restricts the technology’s
`throughput to approximately 115 kb/day (1,000 bp; ref. 14).
`Current estimates suggest a cost of approximately $5 to 30
`million USD to sequence an entire human genome using
`Sanger-based methods, and on one machine, it would take
`around 60 years to accomplish this task (8, 13). Together,
`these cost and time constraints limit access to and applica-
`tion of genome sequencing efforts on this platform.
`
`Next-Generation Sequencing
`Overview
`"Next-generation" and "massive-parallel" DNA sequencing
`are blanket terms used to refer collectively to the high-
`throughput DNA sequencing technologies available which
`are capable of sequencing large numbers of different DNA
`sequences in a single reaction (i.e., in parallel). All next-
`generation sequencing (NGS) technologies monitor the
`sequential addition of nucleotides to immobilized and
`spatially arrayed DNA templates but differ substantially in
`how these templates are generated and how they are inter-
`rogated to reveal their sequences (15). A basic workflow for
`NGS sequencing technologies is presented in Fig. 1.
`
`Template generation
`In general, the starting material for all NGS experiments is
`double-stranded DNA; however, the source of this material
`may vary (e.g., genomic DNA, reverse-transcribed RNA or
`cDNA, immunoprecipitated DNA). All starting material
`must be converted into a library of sequencing reaction
`templates (sequencing library), which require common
`steps of fragmentation, size selection, and adapter ligation
`(15). Fragmentation and size selection steps serve to break
`the DNA templates into smaller sequence-able fragments,
`the size of which depend on each sequencing platform’s
`specifications. Adapter ligation adds platform-specific syn-
`
`Review of "Next-Generation" DNA Sequencing
`
`thetic DNAs to the ends of library fragments, which serve as
`primers for downstream amplification and/or sequencing
`reactions. Ideally, the above steps create an unbiased
`sequencing library that accurately represents the sample’s
`DNA population. Depending on the NGS technology used,
`a library is either sequenced directly (single-molecule tem-
`plates) or is amplified then sequenced (clonally amplified
`templates). Template generation also serves to spatially
`separate and immobilize DNA fragment populations for
`sequencing, typically by attachment to solid surfaces or
`beads. This allows the downstream sequencing reaction to
`operate as millions of microreactions carried out in parallel
`on each spatially distinct template (16).
`
`Clonally amplified versus single-molecule templates
`Most sequencing platforms cannot monitor single-mol-
`ecule reactions and template amplification is therefore
`required to produce sufficient signal for detection of nucle-
`otide addition by the instrument’s system (1, 14). Ampli-
`fication strategies vary between platforms and commonly
`include emulsion PCR or bridging amplification strategies
`(Table 1). Importantly, all amplification steps can introduce
`sequencing errors into experiments as reagents (DNA
`polymerases) are not 100% accurate and can introduce
`mutations into the clonally amplified template popula-
`tions, which subsequently masquerade as sequence variants
`in downstream analysis (1). Amplification steps also
`reduce the dynamic range of sequence detection and poten-
`tially remove low-abundance variants from sequenced
`populations.
`Single-molecule template sequencing bypasses the need
`for amplification steps and requires far less starting material
`for sequence detection (17). The ability to sequence
`extremely small quantities of DNA without manipulation
`gives single-molecule sequencing approaches a greater
`(potentially unlimited) dynamic range of sequence detec-
`tion, including the possibility of sequencing DNA from only
`a single cell (18, 19). Currently, single-molecule sequencers
`are just beginning to enter the market; however, despite
`promises to improve signal quality and expand the types of
`data produced, the availability of this platform is limited,
`and downstream analysis pipelines are very immature com-
`pared with those of clonally amplified signals. In addition,
`advantages over amplification-based platforms have yet to
`be realized and remain equally uncertain. Amplification-
`based and single-molecule sequencing technologies have
`been referred to as "second-generation" and "third-generation"
`
`Figure 1. Basic workflow for NGS experiments. NGS experiments consist of 4 phases: sample collection (purple), template generation (blue), sequencing
`reactions and detection (green), and data analysis (orange). Experiments can have broad applications, depending on the source and nature of input DNA used
`for sequencing. Source materials include normal and diseased tissues, from which NGS experiments can sequence the whole genome (WGS) or
`targeted genomic/epigenetic elements. Table 2 lists experimental approaches, descriptions, and key references for the sample collection strategies
`illustrated. Illustration of sample collection is modified from the study of Myers and colleagues (82). Illustration of sequencing reactions and detection is a
`generalized schematic and substantially differs based on the platform used (Table 1). NGS experiments can be grouped broadly into 2 general categories: de
`novo assembly and resequencing. Assembled genomes are built from scratch, without the use of an existing scaffold, whereas resequencing
`experiments align sequence reads back to a reference genome (orange). Since the HGP, all human genome sequencing efforts have been resequencing,
`as it is not cost-effective, extremely difficult, and of limited (immediate) value to reassemble a human genome (95, 96). Conversely, smaller genomes
`such as those of novel bacteria are routinely assembled de novo (97). dsDNA, double-stranded DNA; TFBS, transcription factor–binding sites.
`
`www.aacrjournals.org
`
`Cancer Prev Res; 5(7) July 2012
`
`889
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 889
`
`

`

`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Rizzo and Buck
`
`sequencing technologies, respectively,
`(ref. 16; Table 1).
`
`in the literature
`
`Sequencing reactions
`Each sequencing platform uses a series of repeating
`chemical reactions that are carried out and detected auto-
`matically. Reactions typically use a flow cell that houses the
`immobilized templates and enables standardized addition
`and detection of nucleotides, washing/removal of reagents,
`and repetition of this cyclical process on a nucleotide-by-
`nucleotide basis to sequence all DNA templates (i.e.,
`sequencing library) in parallel. While all sequencing plat-
`forms differ in their proprietary chemistries (Table 1), the
`use of DNA polymerase or DNA ligase enzyme is a common
`feature, and these methods have been referred to collectively
`as "sequencing by synthesis" (SBS) strategies in the literature
`(20).
`Overall, the reading of sequence content on a nucleotide-
`by-nucleotide stepwise fashion used by NGS overcomes the
`limitations of discrete separation and detection require-
`ments of first-generation methods and has radically
`improved the throughput of sequencing reactions by several
`orders of magnitude. Such improvements have allowed the
`per-base cost of sequencing to decrease by more than
`100,000-fold in the past 5 years with further reductions
`expected (21). Cost reductions in sequencing technologies
`have enabled widespread use and diverse applications of
`these methods (see Clinical Applications of NGS and Pre-
`clinical Applications of NGS)
`
`Paired-end and mate-paired sequencing
`Typically, NGS methods sequence only a single end of the
`DNA templates in their libraries, with all DNA fragments
`afforded an equal probability of occurring in forward or
`reverse direction reads. Depending on the instrument and
`library construction protocol used, however, forward and
`reverse reads can be paired to map both ends of linear DNA
`fragments during sequencing ("paired-end sequencing") or
`both ends of previously circularized DNA fragments
`("mate-pair sequencing"). The choice of pair-end or
`mate-pair depends on the clinical application and is dis-
`cussed later. It is important to note that both paired-end and
`mate-pair approaches still only sequence the ends of DNA
`fragments included in sequencing libraries and therefore do
`not provide sequence information for the internal portion
`of longer templates (Fig. 1).
`
`Limitations
`The increased throughput of NGS reactions comes at
`the cost of read length, as the most readily available
`sequencing platforms (Illumina, Roche, SoLiD) offer
`shorter average read lengths (30–400 bp) than conven-
`tional Sanger-based methods (500–1 kb; ref. 13). Several
`third-generation technologies hold the promise of longer
`read lengths; however, these are not widely available and,
`as mentioned, are exceedingly immature technology plat-
`forms (ref. 22; Table 1).
`
`Shorter read lengths place restrictions on the types of
`experiments NGS methods can conduct. For instance, it is
`difficult to assemble a genome de novo using such short
`fragment lengths (23); therefore, most application of these
`technologies focus on comparing the density and sequence
`content of shorter reads to that of an existing reference
`genome (known as genome "re-sequencing"; Fig. 1). In
`addition, shorter read lengths may not align or "map" back
`to a reference genome uniquely, often leaving repetitive
`regions of the genome unmappable to these types of experi-
`ments. Sequence alignment is also challenging for regions
`with higher levels of diversity between the reference genome
`and the sequenced genome, such as structural variants
`(e.g., insertions, deletions, translocations; ref. 24). These
`issues are combated through the use of longer read
`lengths or paired-end/mate-pair approaches (Fig. 2A and
`B). Given the relative immaturity of third-generation NGS
`platforms, nearly all human genome resequencing con-
`ducted today relies on the paired-end or mate-paired
`approaches of second-generation platforms. Paired-end
`sequencing is much easier than mate-paired sequencing
`and requires less DNA, making it the standard means by
`which human genomes are resequenced (14). Although
`more expensive and technically challenging, mate-paired
`libraries can sample DNA sequence over a larger distance
`(1.5–20 kb) than paired-end approaches (300–500 bp)
`and are therefore better suited for mapping very large
`structural changes (14).
`
`Data analysis
`Ironically, one of the key limitations of NGS also serves as
`its greatest strength, the high volume of data generation.
`NGS reactions generate huge sequence data sets in the range
`of megabases (millions) to gigabases (billions), the inter-
`pretation of which is no trivial task (16). Moreover, the scale
`and nature of data produced by all NGS platforms place
`substantial demands on information technology at all
`stages of sequencing, including data tracking, storage, and
`quality control (25). Together, these extensive data gather-
`ing capabilities now double as constraints, shifting the
`bottlenecks in genomics research from data acquisition to
`those of data analysis and interpretation (26). NGS
`machines are generating data at such a rapid pace that
`supply cannot keep up with demand for new analytic
`approaches capable of mining NGS data sets (see Future
`Directions and Challenges). Data analysis is a critical feature
`of any NGS project and will depend on the goal and type of
`project. The initial analysis or base calling is typically con-
`ducted by proprietary software on the sequencing platform.
`After base calling, the sequencing data are aligned to a
`reference genome if available or a de novo assembly is
`conducted. Sequence alignment and assembly is an active
`area of computational research with new methods being
`developed (see review by Flicek and Birney; ref. 27). Once
`the sequence is aligned to a reference genome, the data need
`to be analyzed in an experiment-specific fashion [for
`reviews, see the studies of Martin and Wang for RNA-seq
`(ref. 28), Park for ChIP-seq (ref. 29), Bamshad and
`
`890
`
`Cancer Prev Res; 5(7) July 2012
`
`Cancer Prevention Research
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 890
`
`

`

`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Review of "Next-Generation" DNA Sequencing
`
`Highesterrorrates
`
`throughcompany
`serviceavailable
`used;sequencing
`Machinenotwidely
`
`resequencing
`human
`availablefor
`sample;only
`
`shorterruntimes
`readlengthsand
`Potentialforlongest
`
`bias
`Noamplification
`
`sequencing
`forhuman
`
`reads(11million)
`Lownumberoftotal
`
`Shortreadlengths
`
`(1million)
`totalreads
`lowestnumberof
`Highreagentcost;
`reads(15million)
`Lownumberoftotal
`samereadlength
`cellsequencedat
`Allsamplesonflow
`Limitations
`
`machine
`low-costscalable
`
`Shortruntimes;
`
`multiplexing
`capacityfor
`celllanes;high
`Independentflow
`
`runtimes
`
`Longerreads;fast
`
`Shortruntimes
`
`throughput
`platform;large
`Mostwidelyused
`Strengths
`
`90–300Gb
`
`2a,7b
`
`Ligationsequencing75
`
`EmulsionPCR
`
`SoLiD/ABI5500
`
`0.5–0.6Gb
`
`0.4
`
`400
`
`Pyrosequencing
`
`EmulsionPCR
`amplification
`
`SequencerFLX
`
`RocheGenome
`
`440Mb–7Gb
`
`0.17a,1.1b
`
`250
`
`Reverseterminator
`
`Bridging
`
`IlluminaMiSeq
`
`95–600Gbd
`sequenced)
`perrun(totalbp
`Maximumthroughput
`
`2a,11b
`Runtime,d
`
`100
`length,bp
`Maximumread
`
`Reverseterminator
`chemistry
`reaction
`Sequencing
`
`amplification
`
`Bridging
`preparation
`Library/template
`
`2000
`
`IlluminaHiSeq
`platform
`Sequencing
`
`Table1.ComparisonofNGSplatforms
`
`dTwoflowcells.
`cCompanyestimate.
`bPair-endsequencing.
`aSingle-endsequencing.
`Abbreviation:N/A,notapplicable.
`andreadersaredirectedtoanarticlebyBrantonandcolleagues(90).
`literature(16).Thetermthird-generationsequencinghasalsobeenusedtorefertonear-termnanoporesequencingtechnologies.Nanoporesequencingisnotcoveredinthisreview,
`NOTE:Amplification-basedandsingle-moleculesequencingtechnologieshavebeenreferredtoassecond-andthird-generationsequencingtechnologies,respectively,inthe
`
`N/A
`
`<0.1c
`
`1,000
`
`Real-time
`
`Singlemolecule
`
`PacBioRS
`
`21–35Gb
`
`8
`
`55
`
`Reverseterminator
`
`Singlemolecule
`
`Helicos
`
`Highcostper
`
`Completeservice
`
`20–60Gb
`
`1Gb
`
`12
`
`0.1
`
`Ligationsequencing70
`
`nanoballs
`PCRonDNA
`
`Genomics
`
`Complete
`
`200
`
`Ionsequencing
`
`EmulsionPCR
`
`IonPersonal
`
`GenomeMachine
`
`www.aacrjournals.org
`
`Cancer Prev Res; 5(7) July 2012
`
`891
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 891
`
`

`

`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Rizzo and Buck
`
`A
`
`Identification of structural variants
`B
`
`Ex. mapping translocations
`
`Ex. mapping deletions
`
`Ch17 (q21.33)
`
`Ch22 (q13.1)
`
`Paired-end or mate-pair reads
`
`COL1A1
`
`PDGFB
`
`Single-end reads
`
`Paired-end or mate-pair reads
`
`Single-end reads
`
`C
`
`Sequence coverage and error rates
`
`Reference genome
`Aligned NGS read
`
`Equal coverage
`(~ 4×)
`Zero reads
`
`Uniform
`
`Uneven
`
`NGS read distribution
`
`SNP detection
`
`Detected SNPs
`
`D
`
`Reference genome
`Aligned NGS read
`True SNP
`Sequencing error
`
`Detected SNP
`
`TCGTCAGTTCGACTA
`TTCGTC TGTTCGACT
`TTCGTCGGTTCGACTA
`
`Undetected SNP
`
`Low coverage
`
`and
`
`Uneven read
`distribution
`
`Increased coverage
`correcting errors and uneven distribution
`
`Figure 2. Key NGS principles. A and B, identification of structural variants: Longer (paired-end or mate-pair) sequencing reads are more adept at mapping large
`structural variations (e.g., translocations and deletions) because they provide added information concerning which sequences co-occur on the same
`template. A, illustration of well-documented translocation between chromosomes 17 and 22 which places the platelet-derived growth factor-b
`(PDGFB) gene under control of the highly active collagen type 1A1 promoter (COL1A1). This translocation is implicated in the pathogenesis of the rare
`cutaneous malignancy dermatofibrosarcoma protuberans (DFSP; ref. 98). Notice how alignment of short reads does not distinguish the mutated
`sequence (translocation) from the normal reference genome. B, illustration of a hypothetical chromosomal deletion mutant. Again, notice how alignment of
`short reads does not distinguish the mutated sequence (deletion) from the normal reference genome. C, sequence coverage and error rates. Illustration of
`uniform and uneven NGS read distributions for resequencing experiments. An uneven read distribution can leave regions of the genome uncovered
`(black circle). D, SNP detection. Left, illustration of how low coverage, uneven read distributions, and high error rates can interact to confound
`genotype detection, including SNP calling (as illustrated). Right, illustration of how uneven read distributions and errors can be overcome by higher
`coverage rates.
`
`892
`
`Cancer Prev Res; 5(7) July 2012
`
`Cancer Prevention Research
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 892
`
`

`

`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`colleagues for exome sequencing (ref. 30), Medvedev and
`colleagues for whole-genome resequencing (ref. 31), and
`Wooley and colleagues for metagenomics (ref. 32)].
`
`Choosing a Sequencing Technology
`Genomics experiments are largely descriptive and afford
`researchers the opportunity to explore a biologic question
`in a comprehensive manner. Experimental design is para-
`mount for the success of all genomics experiments, and
`choice of sequencing strategy should be informed by the
`goal(s) of the project.
`
`Experimental design and biases
`It is important to recognize that bias can be introduced at
`all steps in an experimental NGS protocol. This principle is
`best illustrated by the example of template amplification
`steps, which can introduce mutations into clonally ampli-
`fied DNA templates that subsequently masquerade as
`sequence variants. Amplification steps also reduce the
`dynamic range of sequence detection and potentially
`remove low-abundance variants from sequenced popula-
`tions. Much like amplification steps, any sample manipu-
`lation can cause quantitative and qualitative artifacts in
`downstream analysis (22, 33). Therefore, it is important to
`design your experiment in a way that maximizes the col-
`lection of sequence information you covet and minimizes
`the biases against it. For example, if your sample is being
`extensively manipulated before sequencing, a reference
`DNA sample with known sequence content and similar
`size/quantity should also be carried through your sequenc-
`ing protocol and analyzed in parallel as a control.
`Another important consideration is the quantity and
`quality of the DNA you choose to sequence. Most NGS
`platforms have proprietary library preparations that are
`optimized for a specific DNA quantity and quality. These
`input metrics are typically easy to achieve using unmodified
`fresh or fresh-frozen samples, however, this may be more
`challenging with clinical specimens, especially archived
`formalin-fixed, paraffin-embedded (FFPE) samples. Thank-
`fully, the use of clinical specimens with limited DNA
`quality/quantity for NGS is an area of active research, and
`work has shown that NGS platforms can handle this input
`material (34, 35). Despite this evidence, it is highly likely
`that any deviation in sample quality/quantity from an NGS
`platform’s optimized protocol will still require extensive
`troubleshooting by the user. An appropriate control under
`such circumstances would again be the sequencing of
`reference DNA treated with the same conditions (e.g., FFPE)
`before sequencing.
`
`Sequencing coverage and error rates
`Both the quality and quantity of sequence data derived
`from NGS experiments will ultimately determine how com-
`prehensive and accurate downstream analyses can be (36).
`Qualitatively,
`individual base calling error
`rates vary
`between NGS platforms (Table 1). All NGS platforms pro-
`vide confidence scores for each individual base call,
`
`Review of "Next-Generation" DNA Sequencing
`
`enabling researchers to use different quality filters when
`mining their sequence data. More generally, the chemistries
`of most NGS reactions are such that the initial portion of
`each sequencing read is typically more accurate than the
`latter (due to signal decay).
`Quantitatively, the amount of sequence data can be
`assessed by the metric of sequencing "coverage." Generally
`speaking, sequence coverage (also called "depth") refers to
`the average number of times a base pair is sequenced in a
`given experiment. More specifically, this coverage metric is
`best viewed in the context of the physical locations (distri-
`bution) of these reads, as NGS reactions may not represent
`all genomic locations uniformly (due to handling, platform
`biases, run-to-run variation; ref. 36). For example, local
`sequence content has been shown to exert a bias on the
`coverage of short read NGS platforms, whereby higher read
`densities are more likely to be found at genomic regions
`with elevated GC content (37). Such coverage biases can
`interfere with quantitative applications of NGS, including
`gene expression profiling (RNA-seq) or copy number var-
`iation analysis. Several methods have been developed to
`account for the nonuniformity of coverage and adjust
`signals for GC bias to improve the accuracy of quantitative
`analysis (37–39). Qualitatively, uneven sequence coverage
`can also interfere with the analysis of sequence variants. For
`example, a deeply sequenced sample with nonuniform read
`distribution can still leave a substantial portion of the
`genome unsequenced or undersequenced, and analysis of
`these regions will not be able to identify sequence variations
`such as single-nucleotide polymorphisms (SNP), point
`mutations, or structural variants, because these locations
`will either be unsequenced or confounded by sequencing
`errors (Fig. 2C and D).
`Ultimately, coverage depth, distribution, and sequence
`quality determine what information can be retrieved from
`each sequencing experiment. In theory, an experiment with
`100% accuracy and uniform coverage distribution would
`provide all sequence content information (including iden-
`tification of SNPs and complex structural variants) with just
`1 coverage depth. In reality, however, accuracy is never
`100% and coverage is not uniform; therefore, deeper
`sequence coverage is needed to enable correction of
`sequencing errors and to compensate for uneven coverage
`(Fig. 2D). For discovery of structural variants (e.g., inser-
`tions, deletions, translocations), accurate identification of a
`complete human genome sequence with current (second-
`generation) platforms, requires approximately 20 to 30
`sequence coverage to overcome the uneven read distribu-
`tions and sequencing errors (22, 24). Higher coverage levels
`are required to make accurate SNP calls from an individual
`genome sequence, as these experiments are powered dif-
`ferently (36). Standards are evolving and current recom-
`mendations range from 30 to 100 coverage, depending
`on both platform error rate and the analytic sensitivity and
`specificity desired (12, 24, 36, 40). These higher coverage
`requirements are why the cost of whole-genome sequencing
`(WGS) still remains above $1,000 for many sequencing
`applications
`(16). Newer
`single-molecule sequencers
`
`www.aacrjournals.org
`
`Cancer Prev Res; 5(7) July 2012
`
`893
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 893
`
`

`

`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Rizzo and Buck
`
`promise more evenly distributed and longer reads, poten-
`tially providing a complete genome sequence at lower costs;
`however, this ability often comes at the cost of higher error
`rates (1, 22). To save costs, population-scale sequencing
`projects, such as the 1,000 genomes project, have used low-
`coverage pooled data sets and are able to detect SNP variants
`with frequencies as low as 1% in DNA populations with 4
`coverage and error rates common to second-generation
`platforms (41). With this approach, an investigator probing
`for high-frequency variants (>1%) in a select population
`could also succeed with lower coverage sequencing experi-
`ments to reduce costs.
`
`Clinical Applications of NGS
`Investment in the development of NGS technologies by
`the NHGRI was made with the goal of expediting the use of
`genome sequencing data in the clinical practice of medicine.
`As the cost of DNA sequencing continues to drop, the actual
`translation of base pair reads to bedside clinical applica-
`tions has finally begun. Several small-scale studies have laid
`the foundation for personalized genome-based medicine,
`showing the value of both whole-genome and targeted
`sequencing approaches in the diagnosis and treatment of
`diseases. These findings are the first of many to follow, and
`this progression, coupled to the expanding definition of
`genetic influences on clinical phenotypes identified by
`preclinical studies, will place NGS machines among the
`most valuable clinical tools available to modern medicine.
`
`Exome and targeted sequencing
`At present, a small percentage of the human genome’s
`sequence is characterized (<10%), and limited clinically
`valuable information can be immediately gained from
`having a patient’s complete genome sequence at this time.
`Therefore, it is often more cost-effective for clinical research-
`ers to sequence only the exome (the 2% of the genome
`represented by protein-coding regions or exons), the "Men-
`delianome" (coding regions of 2,993 known disease genes),
`or targeted disease gene panels to screen for relevant mu

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket