`
`Review
`
`Key Principles and Clinical Applications of "Next-Generation"
`DNA Sequencing
`
`Cancer
`Prevention
`Research
`
`Jason M. Rizzo and Michael J. Buck
`
`Abstract
`
`Demand for fast, inexpensive, and accurate DNA sequencing data has led to the birth and dominance of
`a new generation of sequencing technologies. So-called "next-generation" sequencing technologies enable
`rapid generation of data by sequencing massive amounts of DNA in parallel using diverse methodologies
`which overcome the limitations of Sanger sequencing methods used to sequence the first human genome.
`Despite opening new frontiers of genomics research, the fundamental shift away from the Sanger
`sequencing that next-generation technologies has created has also left many unaware of the capabilities
`and applications of these new technologies, especially those in the clinical realm. Moreover, the brisk
`evolution of sequencing technologies has flooded the market with commercially available sequencing
`platforms, whose unique chemistries and diverse applications stand as another obstacle restricting the
`potential of next-generation sequencing. This review serves to provide a primer on next-generation
`sequencing technologies for clinical researchers and physician scientists. We provide an overview of the
`capabilities and clinical applications of DNA sequencing technologies to raise awareness among researchers
`about the power of these novel genomic tools. In addition, we discuss that key sequencing principles provide
`a comparison between existing and near-term technologies and outline key advantages and disadvantages
`between different sequencing platforms to help researchers choose an appropriate platform for their
`research interests. Cancer Prev Res; 5(7); 887–900. Ó2012 AACR.
`
`Introduction
`Initial sequencing of the human genome took more than
`a decade and cost an estimated $70 million dollars (1).
`Sequencing for the Human Genome Project (HGP) relied
`primarily on automation of sequencing methods first intro-
`duced by Sanger in 1977 (2). Despite the successful use this
`technology to generate early maps of the human genome
`(3–5), the limitations of Sanger sequencing created a high
`demand for more robust sequencing technologies capable
`of generating large amounts of data, quickly, and at lower
`costs.
`Recognizing this need, the National Human Genome
`Research Institute (NHGRI) initiated a funding program
`in 2004 aimed at catalyzing sequencing technology devel-
`opment and with a goal of reducing the cost of genome
`sequencing to$100,000 in 5 years and, ultimately, $1,000
`in 10 years (6–8). The initiative has been widely successful
`to date, and a bevy of new technologies has emerged in the
`sequencing marketplace over the past 5 years. New tech-
`
`Authors' Affiliation: Department of Biochemistry and Center of Excellence
`in Bioinformatics and Life Sciences, State University of New York at Buffalo,
`Buffalo, New York
`
`Corresponding Author: Michael J. Buck, State University of New York at
`Buffalo, 701 Elicott St., Buffalo, NY 14203. Phone: 716-881-7569; Fax: 716-
`849-6655; E-mail: mjbuck@buffalo.edu
`
`doi: 10.1158/1940-6207.CAPR-11-0432
`Ó2012 American Association for Cancer Research.
`
`nologies offer radically different approaches that enable the
`sequencing of large amounts of DNA in parallel and at
`substantially lower costs than conventional methods. The
`terms "next-generation sequencing" and "massive-parallel
`sequencing" have been used loosely to collectively refer to
`these new high-throughput technologies.
`
`First-Generation Sequencing
`Automated Sanger sequencing is now considered the
`"first-generation" of DNA sequencing technologies. Techni-
`cally,
`standard Sanger
`sequencing
`identifies
`linear
`sequences of nucleotides by electrophoretic separation of
`randomly terminated extension products (2). Automated
`methods use fluorescently labeled terminators, capillary
`electrophoresis separation, and automated laser signal
`detection for improved nucleotide sequence detection
`[ref. 9; for reviews, see the studies of Hutchinson (ref. 10)
`and Metzker (ref. 11)]. As a key strength, Sanger sequencing
`remains the most available technology today and its well-
`defined chemistry makes it is the most accurate method for
`sequencing available now. Sanger sequencing reactions can
`read DNA fragments of 500 bp to 1 kb in length, and this
`method is still used routinely for sequencing small amounts
`of DNA fragments and is the gold-standard for clinical
`cytogenetics (12).
`Despite strong availability and accuracy, however, Sanger
`sequencing has restricted applications because of technical
`limitations of its workflow. The main limitation of Sanger
`
`www.aacrjournals.org
`
`887
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 887
`
`FOUNDATION EXHIBIT 1037
`IPR2019-00634
`
`
`
`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Hypersensitive
`sites
`
`TFBSs
`
`CH3
`
`RNA-
`binding
`protein
`
`Histone modifications (e.g., acetylation, methylation)
`DNA modifications (e.g., methylation)
`E.g., splicing variants
`and copy number
`variation
`
`RNA
`polymerase
`
`Nucleosome
`(histone proteins)
`
`CH3CO
`
`CH3
`
`RIP-seq
`
`MNase-seq
`
`5c-seq
`
`DNase-seq
`FAIRE-seq
`
`Bisulfite-seq
`
`ChIP-seq
`
`Long-range regulatory elements
`(enhancers, repressors/
`silencers, insulators)
`
`Cis-regulatory elements
`(promoters, TFBSs)
`
`Transcripts
`
`RNA-seq
`
`RNA
`
`Reverse
`transcription
`
`Gene
`
`Intron (noncoding)
`Exon (coding region)
`
`“Exome” sequencing
`
`cDNA
`
`(dsDNA)
`
`Fragmentation and size selection
`
`Adapter ligation
`
`Amplification
`
`Template immobilization and spatial separation
`
`Targeted genomic/Epigenomic sequencing
`
`Rizzo and Buck
`
`Sample
`collection
`
`E.g., Normal tissue
`
`Source material
`
`E.g., Tumor sample
`
`WGS
`
`Template
`generation
`
`5´
`3´
`
`Single-end reads
`NGS
`Sequence read
`
`NGS
`Primer
`
`DNA sequence template
`
`3´
`5´
`
`Sequencing reactions and detection
`
`Paired-end reads
`
`Unsequenced
`
`Shorter DNA templates = Total coverage
`
`Longer DNA template = partial coverage
`
`Assemble NGS reads de novo
`
`Align NGS reads to reference genome
`
`Data analysis
`
`(+)
`
`(Reference genome)
`
`(–)
`
`(Assembled genome)
`
`888
`
`Cancer Prev Res; 5(7) July 2012
`
`Cancer Prevention Research
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 888
`
`
`
`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`sequencing is one of throughput, that is, the amount of
`DNA sequence that can be read with each sequencing
`reaction. Throughput is a function of sequencing reaction
`time, the number of sequencing reactions that can be run in
`parallel, and lengths of sequences read by each reaction. The
`requirement for electrophoretic separation of DNA frag-
`ments for reading DNA sequence content in Sanger-based
`sequencing is the primary bottleneck for throughput with
`this method, increasing time and limiting the number of
`reactions that can be run in parallel (13). Despite efficient
`automation, each Sanger instrument can only read 96
`reactions in parallel, and this restricts the technology’s
`throughput to approximately 115 kb/day (1,000 bp; ref. 14).
`Current estimates suggest a cost of approximately $5 to 30
`million USD to sequence an entire human genome using
`Sanger-based methods, and on one machine, it would take
`around 60 years to accomplish this task (8, 13). Together,
`these cost and time constraints limit access to and applica-
`tion of genome sequencing efforts on this platform.
`
`Next-Generation Sequencing
`Overview
`"Next-generation" and "massive-parallel" DNA sequencing
`are blanket terms used to refer collectively to the high-
`throughput DNA sequencing technologies available which
`are capable of sequencing large numbers of different DNA
`sequences in a single reaction (i.e., in parallel). All next-
`generation sequencing (NGS) technologies monitor the
`sequential addition of nucleotides to immobilized and
`spatially arrayed DNA templates but differ substantially in
`how these templates are generated and how they are inter-
`rogated to reveal their sequences (15). A basic workflow for
`NGS sequencing technologies is presented in Fig. 1.
`
`Template generation
`In general, the starting material for all NGS experiments is
`double-stranded DNA; however, the source of this material
`may vary (e.g., genomic DNA, reverse-transcribed RNA or
`cDNA, immunoprecipitated DNA). All starting material
`must be converted into a library of sequencing reaction
`templates (sequencing library), which require common
`steps of fragmentation, size selection, and adapter ligation
`(15). Fragmentation and size selection steps serve to break
`the DNA templates into smaller sequence-able fragments,
`the size of which depend on each sequencing platform’s
`specifications. Adapter ligation adds platform-specific syn-
`
`Review of "Next-Generation" DNA Sequencing
`
`thetic DNAs to the ends of library fragments, which serve as
`primers for downstream amplification and/or sequencing
`reactions. Ideally, the above steps create an unbiased
`sequencing library that accurately represents the sample’s
`DNA population. Depending on the NGS technology used,
`a library is either sequenced directly (single-molecule tem-
`plates) or is amplified then sequenced (clonally amplified
`templates). Template generation also serves to spatially
`separate and immobilize DNA fragment populations for
`sequencing, typically by attachment to solid surfaces or
`beads. This allows the downstream sequencing reaction to
`operate as millions of microreactions carried out in parallel
`on each spatially distinct template (16).
`
`Clonally amplified versus single-molecule templates
`Most sequencing platforms cannot monitor single-mol-
`ecule reactions and template amplification is therefore
`required to produce sufficient signal for detection of nucle-
`otide addition by the instrument’s system (1, 14). Ampli-
`fication strategies vary between platforms and commonly
`include emulsion PCR or bridging amplification strategies
`(Table 1). Importantly, all amplification steps can introduce
`sequencing errors into experiments as reagents (DNA
`polymerases) are not 100% accurate and can introduce
`mutations into the clonally amplified template popula-
`tions, which subsequently masquerade as sequence variants
`in downstream analysis (1). Amplification steps also
`reduce the dynamic range of sequence detection and poten-
`tially remove low-abundance variants from sequenced
`populations.
`Single-molecule template sequencing bypasses the need
`for amplification steps and requires far less starting material
`for sequence detection (17). The ability to sequence
`extremely small quantities of DNA without manipulation
`gives single-molecule sequencing approaches a greater
`(potentially unlimited) dynamic range of sequence detec-
`tion, including the possibility of sequencing DNA from only
`a single cell (18, 19). Currently, single-molecule sequencers
`are just beginning to enter the market; however, despite
`promises to improve signal quality and expand the types of
`data produced, the availability of this platform is limited,
`and downstream analysis pipelines are very immature com-
`pared with those of clonally amplified signals. In addition,
`advantages over amplification-based platforms have yet to
`be realized and remain equally uncertain. Amplification-
`based and single-molecule sequencing technologies have
`been referred to as "second-generation" and "third-generation"
`
`Figure 1. Basic workflow for NGS experiments. NGS experiments consist of 4 phases: sample collection (purple), template generation (blue), sequencing
`reactions and detection (green), and data analysis (orange). Experiments can have broad applications, depending on the source and nature of input DNA used
`for sequencing. Source materials include normal and diseased tissues, from which NGS experiments can sequence the whole genome (WGS) or
`targeted genomic/epigenetic elements. Table 2 lists experimental approaches, descriptions, and key references for the sample collection strategies
`illustrated. Illustration of sample collection is modified from the study of Myers and colleagues (82). Illustration of sequencing reactions and detection is a
`generalized schematic and substantially differs based on the platform used (Table 1). NGS experiments can be grouped broadly into 2 general categories: de
`novo assembly and resequencing. Assembled genomes are built from scratch, without the use of an existing scaffold, whereas resequencing
`experiments align sequence reads back to a reference genome (orange). Since the HGP, all human genome sequencing efforts have been resequencing,
`as it is not cost-effective, extremely difficult, and of limited (immediate) value to reassemble a human genome (95, 96). Conversely, smaller genomes
`such as those of novel bacteria are routinely assembled de novo (97). dsDNA, double-stranded DNA; TFBS, transcription factor–binding sites.
`
`www.aacrjournals.org
`
`Cancer Prev Res; 5(7) July 2012
`
`889
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 889
`
`
`
`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Rizzo and Buck
`
`sequencing technologies, respectively,
`(ref. 16; Table 1).
`
`in the literature
`
`Sequencing reactions
`Each sequencing platform uses a series of repeating
`chemical reactions that are carried out and detected auto-
`matically. Reactions typically use a flow cell that houses the
`immobilized templates and enables standardized addition
`and detection of nucleotides, washing/removal of reagents,
`and repetition of this cyclical process on a nucleotide-by-
`nucleotide basis to sequence all DNA templates (i.e.,
`sequencing library) in parallel. While all sequencing plat-
`forms differ in their proprietary chemistries (Table 1), the
`use of DNA polymerase or DNA ligase enzyme is a common
`feature, and these methods have been referred to collectively
`as "sequencing by synthesis" (SBS) strategies in the literature
`(20).
`Overall, the reading of sequence content on a nucleotide-
`by-nucleotide stepwise fashion used by NGS overcomes the
`limitations of discrete separation and detection require-
`ments of first-generation methods and has radically
`improved the throughput of sequencing reactions by several
`orders of magnitude. Such improvements have allowed the
`per-base cost of sequencing to decrease by more than
`100,000-fold in the past 5 years with further reductions
`expected (21). Cost reductions in sequencing technologies
`have enabled widespread use and diverse applications of
`these methods (see Clinical Applications of NGS and Pre-
`clinical Applications of NGS)
`
`Paired-end and mate-paired sequencing
`Typically, NGS methods sequence only a single end of the
`DNA templates in their libraries, with all DNA fragments
`afforded an equal probability of occurring in forward or
`reverse direction reads. Depending on the instrument and
`library construction protocol used, however, forward and
`reverse reads can be paired to map both ends of linear DNA
`fragments during sequencing ("paired-end sequencing") or
`both ends of previously circularized DNA fragments
`("mate-pair sequencing"). The choice of pair-end or
`mate-pair depends on the clinical application and is dis-
`cussed later. It is important to note that both paired-end and
`mate-pair approaches still only sequence the ends of DNA
`fragments included in sequencing libraries and therefore do
`not provide sequence information for the internal portion
`of longer templates (Fig. 1).
`
`Limitations
`The increased throughput of NGS reactions comes at
`the cost of read length, as the most readily available
`sequencing platforms (Illumina, Roche, SoLiD) offer
`shorter average read lengths (30–400 bp) than conven-
`tional Sanger-based methods (500–1 kb; ref. 13). Several
`third-generation technologies hold the promise of longer
`read lengths; however, these are not widely available and,
`as mentioned, are exceedingly immature technology plat-
`forms (ref. 22; Table 1).
`
`Shorter read lengths place restrictions on the types of
`experiments NGS methods can conduct. For instance, it is
`difficult to assemble a genome de novo using such short
`fragment lengths (23); therefore, most application of these
`technologies focus on comparing the density and sequence
`content of shorter reads to that of an existing reference
`genome (known as genome "re-sequencing"; Fig. 1). In
`addition, shorter read lengths may not align or "map" back
`to a reference genome uniquely, often leaving repetitive
`regions of the genome unmappable to these types of experi-
`ments. Sequence alignment is also challenging for regions
`with higher levels of diversity between the reference genome
`and the sequenced genome, such as structural variants
`(e.g., insertions, deletions, translocations; ref. 24). These
`issues are combated through the use of longer read
`lengths or paired-end/mate-pair approaches (Fig. 2A and
`B). Given the relative immaturity of third-generation NGS
`platforms, nearly all human genome resequencing con-
`ducted today relies on the paired-end or mate-paired
`approaches of second-generation platforms. Paired-end
`sequencing is much easier than mate-paired sequencing
`and requires less DNA, making it the standard means by
`which human genomes are resequenced (14). Although
`more expensive and technically challenging, mate-paired
`libraries can sample DNA sequence over a larger distance
`(1.5–20 kb) than paired-end approaches (300–500 bp)
`and are therefore better suited for mapping very large
`structural changes (14).
`
`Data analysis
`Ironically, one of the key limitations of NGS also serves as
`its greatest strength, the high volume of data generation.
`NGS reactions generate huge sequence data sets in the range
`of megabases (millions) to gigabases (billions), the inter-
`pretation of which is no trivial task (16). Moreover, the scale
`and nature of data produced by all NGS platforms place
`substantial demands on information technology at all
`stages of sequencing, including data tracking, storage, and
`quality control (25). Together, these extensive data gather-
`ing capabilities now double as constraints, shifting the
`bottlenecks in genomics research from data acquisition to
`those of data analysis and interpretation (26). NGS
`machines are generating data at such a rapid pace that
`supply cannot keep up with demand for new analytic
`approaches capable of mining NGS data sets (see Future
`Directions and Challenges). Data analysis is a critical feature
`of any NGS project and will depend on the goal and type of
`project. The initial analysis or base calling is typically con-
`ducted by proprietary software on the sequencing platform.
`After base calling, the sequencing data are aligned to a
`reference genome if available or a de novo assembly is
`conducted. Sequence alignment and assembly is an active
`area of computational research with new methods being
`developed (see review by Flicek and Birney; ref. 27). Once
`the sequence is aligned to a reference genome, the data need
`to be analyzed in an experiment-specific fashion [for
`reviews, see the studies of Martin and Wang for RNA-seq
`(ref. 28), Park for ChIP-seq (ref. 29), Bamshad and
`
`890
`
`Cancer Prev Res; 5(7) July 2012
`
`Cancer Prevention Research
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 890
`
`
`
`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Review of "Next-Generation" DNA Sequencing
`
`Highesterrorrates
`
`throughcompany
`serviceavailable
`used;sequencing
`Machinenotwidely
`
`resequencing
`human
`availablefor
`sample;only
`
`shorterruntimes
`readlengthsand
`Potentialforlongest
`
`bias
`Noamplification
`
`sequencing
`forhuman
`
`reads(11million)
`Lownumberoftotal
`
`Shortreadlengths
`
`(1million)
`totalreads
`lowestnumberof
`Highreagentcost;
`reads(15million)
`Lownumberoftotal
`samereadlength
`cellsequencedat
`Allsamplesonflow
`Limitations
`
`machine
`low-costscalable
`
`Shortruntimes;
`
`multiplexing
`capacityfor
`celllanes;high
`Independentflow
`
`runtimes
`
`Longerreads;fast
`
`Shortruntimes
`
`throughput
`platform;large
`Mostwidelyused
`Strengths
`
`90–300Gb
`
`2a,7b
`
`Ligationsequencing75
`
`EmulsionPCR
`
`SoLiD/ABI5500
`
`0.5–0.6Gb
`
`0.4
`
`400
`
`Pyrosequencing
`
`EmulsionPCR
`amplification
`
`SequencerFLX
`
`RocheGenome
`
`440Mb–7Gb
`
`0.17a,1.1b
`
`250
`
`Reverseterminator
`
`Bridging
`
`IlluminaMiSeq
`
`95–600Gbd
`sequenced)
`perrun(totalbp
`Maximumthroughput
`
`2a,11b
`Runtime,d
`
`100
`length,bp
`Maximumread
`
`Reverseterminator
`chemistry
`reaction
`Sequencing
`
`amplification
`
`Bridging
`preparation
`Library/template
`
`2000
`
`IlluminaHiSeq
`platform
`Sequencing
`
`Table1.ComparisonofNGSplatforms
`
`dTwoflowcells.
`cCompanyestimate.
`bPair-endsequencing.
`aSingle-endsequencing.
`Abbreviation:N/A,notapplicable.
`andreadersaredirectedtoanarticlebyBrantonandcolleagues(90).
`literature(16).Thetermthird-generationsequencinghasalsobeenusedtorefertonear-termnanoporesequencingtechnologies.Nanoporesequencingisnotcoveredinthisreview,
`NOTE:Amplification-basedandsingle-moleculesequencingtechnologieshavebeenreferredtoassecond-andthird-generationsequencingtechnologies,respectively,inthe
`
`N/A
`
`<0.1c
`
`1,000
`
`Real-time
`
`Singlemolecule
`
`PacBioRS
`
`21–35Gb
`
`8
`
`55
`
`Reverseterminator
`
`Singlemolecule
`
`Helicos
`
`Highcostper
`
`Completeservice
`
`20–60Gb
`
`1Gb
`
`12
`
`0.1
`
`Ligationsequencing70
`
`nanoballs
`PCRonDNA
`
`Genomics
`
`Complete
`
`200
`
`Ionsequencing
`
`EmulsionPCR
`
`IonPersonal
`
`GenomeMachine
`
`www.aacrjournals.org
`
`Cancer Prev Res; 5(7) July 2012
`
`891
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 891
`
`
`
`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Rizzo and Buck
`
`A
`
`Identification of structural variants
`B
`
`Ex. mapping translocations
`
`Ex. mapping deletions
`
`Ch17 (q21.33)
`
`Ch22 (q13.1)
`
`Paired-end or mate-pair reads
`
`COL1A1
`
`PDGFB
`
`Single-end reads
`
`Paired-end or mate-pair reads
`
`Single-end reads
`
`C
`
`Sequence coverage and error rates
`
`Reference genome
`Aligned NGS read
`
`Equal coverage
`(~ 4×)
`Zero reads
`
`Uniform
`
`Uneven
`
`NGS read distribution
`
`SNP detection
`
`Detected SNPs
`
`D
`
`Reference genome
`Aligned NGS read
`True SNP
`Sequencing error
`
`Detected SNP
`
`TCGTCAGTTCGACTA
`TTCGTC TGTTCGACT
`TTCGTCGGTTCGACTA
`
`Undetected SNP
`
`Low coverage
`
`and
`
`Uneven read
`distribution
`
`Increased coverage
`correcting errors and uneven distribution
`
`Figure 2. Key NGS principles. A and B, identification of structural variants: Longer (paired-end or mate-pair) sequencing reads are more adept at mapping large
`structural variations (e.g., translocations and deletions) because they provide added information concerning which sequences co-occur on the same
`template. A, illustration of well-documented translocation between chromosomes 17 and 22 which places the platelet-derived growth factor-b
`(PDGFB) gene under control of the highly active collagen type 1A1 promoter (COL1A1). This translocation is implicated in the pathogenesis of the rare
`cutaneous malignancy dermatofibrosarcoma protuberans (DFSP; ref. 98). Notice how alignment of short reads does not distinguish the mutated
`sequence (translocation) from the normal reference genome. B, illustration of a hypothetical chromosomal deletion mutant. Again, notice how alignment of
`short reads does not distinguish the mutated sequence (deletion) from the normal reference genome. C, sequence coverage and error rates. Illustration of
`uniform and uneven NGS read distributions for resequencing experiments. An uneven read distribution can leave regions of the genome uncovered
`(black circle). D, SNP detection. Left, illustration of how low coverage, uneven read distributions, and high error rates can interact to confound
`genotype detection, including SNP calling (as illustrated). Right, illustration of how uneven read distributions and errors can be overcome by higher
`coverage rates.
`
`892
`
`Cancer Prev Res; 5(7) July 2012
`
`Cancer Prevention Research
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 892
`
`
`
`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`colleagues for exome sequencing (ref. 30), Medvedev and
`colleagues for whole-genome resequencing (ref. 31), and
`Wooley and colleagues for metagenomics (ref. 32)].
`
`Choosing a Sequencing Technology
`Genomics experiments are largely descriptive and afford
`researchers the opportunity to explore a biologic question
`in a comprehensive manner. Experimental design is para-
`mount for the success of all genomics experiments, and
`choice of sequencing strategy should be informed by the
`goal(s) of the project.
`
`Experimental design and biases
`It is important to recognize that bias can be introduced at
`all steps in an experimental NGS protocol. This principle is
`best illustrated by the example of template amplification
`steps, which can introduce mutations into clonally ampli-
`fied DNA templates that subsequently masquerade as
`sequence variants. Amplification steps also reduce the
`dynamic range of sequence detection and potentially
`remove low-abundance variants from sequenced popula-
`tions. Much like amplification steps, any sample manipu-
`lation can cause quantitative and qualitative artifacts in
`downstream analysis (22, 33). Therefore, it is important to
`design your experiment in a way that maximizes the col-
`lection of sequence information you covet and minimizes
`the biases against it. For example, if your sample is being
`extensively manipulated before sequencing, a reference
`DNA sample with known sequence content and similar
`size/quantity should also be carried through your sequenc-
`ing protocol and analyzed in parallel as a control.
`Another important consideration is the quantity and
`quality of the DNA you choose to sequence. Most NGS
`platforms have proprietary library preparations that are
`optimized for a specific DNA quantity and quality. These
`input metrics are typically easy to achieve using unmodified
`fresh or fresh-frozen samples, however, this may be more
`challenging with clinical specimens, especially archived
`formalin-fixed, paraffin-embedded (FFPE) samples. Thank-
`fully, the use of clinical specimens with limited DNA
`quality/quantity for NGS is an area of active research, and
`work has shown that NGS platforms can handle this input
`material (34, 35). Despite this evidence, it is highly likely
`that any deviation in sample quality/quantity from an NGS
`platform’s optimized protocol will still require extensive
`troubleshooting by the user. An appropriate control under
`such circumstances would again be the sequencing of
`reference DNA treated with the same conditions (e.g., FFPE)
`before sequencing.
`
`Sequencing coverage and error rates
`Both the quality and quantity of sequence data derived
`from NGS experiments will ultimately determine how com-
`prehensive and accurate downstream analyses can be (36).
`Qualitatively,
`individual base calling error
`rates vary
`between NGS platforms (Table 1). All NGS platforms pro-
`vide confidence scores for each individual base call,
`
`Review of "Next-Generation" DNA Sequencing
`
`enabling researchers to use different quality filters when
`mining their sequence data. More generally, the chemistries
`of most NGS reactions are such that the initial portion of
`each sequencing read is typically more accurate than the
`latter (due to signal decay).
`Quantitatively, the amount of sequence data can be
`assessed by the metric of sequencing "coverage." Generally
`speaking, sequence coverage (also called "depth") refers to
`the average number of times a base pair is sequenced in a
`given experiment. More specifically, this coverage metric is
`best viewed in the context of the physical locations (distri-
`bution) of these reads, as NGS reactions may not represent
`all genomic locations uniformly (due to handling, platform
`biases, run-to-run variation; ref. 36). For example, local
`sequence content has been shown to exert a bias on the
`coverage of short read NGS platforms, whereby higher read
`densities are more likely to be found at genomic regions
`with elevated GC content (37). Such coverage biases can
`interfere with quantitative applications of NGS, including
`gene expression profiling (RNA-seq) or copy number var-
`iation analysis. Several methods have been developed to
`account for the nonuniformity of coverage and adjust
`signals for GC bias to improve the accuracy of quantitative
`analysis (37–39). Qualitatively, uneven sequence coverage
`can also interfere with the analysis of sequence variants. For
`example, a deeply sequenced sample with nonuniform read
`distribution can still leave a substantial portion of the
`genome unsequenced or undersequenced, and analysis of
`these regions will not be able to identify sequence variations
`such as single-nucleotide polymorphisms (SNP), point
`mutations, or structural variants, because these locations
`will either be unsequenced or confounded by sequencing
`errors (Fig. 2C and D).
`Ultimately, coverage depth, distribution, and sequence
`quality determine what information can be retrieved from
`each sequencing experiment. In theory, an experiment with
`100% accuracy and uniform coverage distribution would
`provide all sequence content information (including iden-
`tification of SNPs and complex structural variants) with just
`1 coverage depth. In reality, however, accuracy is never
`100% and coverage is not uniform; therefore, deeper
`sequence coverage is needed to enable correction of
`sequencing errors and to compensate for uneven coverage
`(Fig. 2D). For discovery of structural variants (e.g., inser-
`tions, deletions, translocations), accurate identification of a
`complete human genome sequence with current (second-
`generation) platforms, requires approximately 20 to 30
`sequence coverage to overcome the uneven read distribu-
`tions and sequencing errors (22, 24). Higher coverage levels
`are required to make accurate SNP calls from an individual
`genome sequence, as these experiments are powered dif-
`ferently (36). Standards are evolving and current recom-
`mendations range from 30 to 100 coverage, depending
`on both platform error rate and the analytic sensitivity and
`specificity desired (12, 24, 36, 40). These higher coverage
`requirements are why the cost of whole-genome sequencing
`(WGS) still remains above $1,000 for many sequencing
`applications
`(16). Newer
`single-molecule sequencers
`
`www.aacrjournals.org
`
`Cancer Prev Res; 5(7) July 2012
`
`893
`
`Downloaded from
`
`on November 19, 2018. © 2012 American Association forcancerpreventionresearch.aacrjournals.org
`
`Cancer Research.
`
`Page 893
`
`
`
`Published OnlineFirst May 22, 2012; DOI: 10.1158/1940-6207.CAPR-11-0432
`
`Rizzo and Buck
`
`promise more evenly distributed and longer reads, poten-
`tially providing a complete genome sequence at lower costs;
`however, this ability often comes at the cost of higher error
`rates (1, 22). To save costs, population-scale sequencing
`projects, such as the 1,000 genomes project, have used low-
`coverage pooled data sets and are able to detect SNP variants
`with frequencies as low as 1% in DNA populations with 4
`coverage and error rates common to second-generation
`platforms (41). With this approach, an investigator probing
`for high-frequency variants (>1%) in a select population
`could also succeed with lower coverage sequencing experi-
`ments to reduce costs.
`
`Clinical Applications of NGS
`Investment in the development of NGS technologies by
`the NHGRI was made with the goal of expediting the use of
`genome sequencing data in the clinical practice of medicine.
`As the cost of DNA sequencing continues to drop, the actual
`translation of base pair reads to bedside clinical applica-
`tions has finally begun. Several small-scale studies have laid
`the foundation for personalized genome-based medicine,
`showing the value of both whole-genome and targeted
`sequencing approaches in the diagnosis and treatment of
`diseases. These findings are the first of many to follow, and
`this progression, coupled to the expanding definition of
`genetic influences on clinical phenotypes identified by
`preclinical studies, will place NGS machines among the
`most valuable clinical tools available to modern medicine.
`
`Exome and targeted sequencing
`At present, a small percentage of the human genome’s
`sequence is characterized (<10%), and limited clinically
`valuable information can be immediately gained from
`having a patient’s complete genome sequence at this time.
`Therefore, it is often more cost-effective for clinical research-
`ers to sequence only the exome (the 2% of the genome
`represented by protein-coding regions or exons), the "Men-
`delianome" (coding regions of 2,993 known disease genes),
`or targeted disease gene panels to screen for relevant mu