`
` A P P L I C AT I O N S O F N E X T- G E N E R AT I O N S E Q U E N C I N G
`
`Advances in understanding
`cancer genomes through
`second-generation sequencing
`
`Matthew Meyerson, Stacey Gabriel and Gad Getz
`
`Abstract | Cancers are caused by the accumulation of genomic alterations.
`Therefore, analyses of cancer genome sequences and structures provide insights
`for understanding cancer biology, diagnosis and therapy. The application of
`second-generation DNA sequencing technologies (also known as next-generation
`sequencing) — through whole-genome, whole-exome and whole-transcriptome
`approaches — is allowing substantial advances in cancer genomics. These methods
`are facilitating an increase in the efficiency and resolution of detection of each of
`the principal types of somatic cancer genome alterations, including nucleotide
`substitutions, small insertions and deletions, copy number alterations, chromosomal
`rearrangements and microbial infections. This Review focuses on the methodological
`considerations for characterizing somatic genome alterations in cancer and the future
`prospects for these approaches.
`
`Second-generation
`sequencing
`Used in this Review to refer to
`sequencing methods that have
`emerged since 2005 that
`parallelize the sequencing
`process and produce millions
`of typically short sequence
`reads (50–400 bases) from
`amplified DNA clones.
`It is also often known as
`next-generation sequencing.
`
`Dana-Farber Cancer Institute,
`44 Binney Street, Boston,
`Massachusetts 02115, USA.
`Broad Institute, 7 Cambridge
`Center, Cambridge,
`Massachusetts 02142, USA.
`Correspondence to M.M.
`email: matthew_meyerson@
`dfci.harvard.edu
`doi:10.1038/nrg2841
`
`A major near-term medical impact of the genome
`technology revolution will be the elucidation of mecha-
`nisms of cancer pathogenesis, leading to improvements
`in the diagnosis of cancer and the selection of cancer
`treatment. Thanks to second-generation sequencing
`technologies1–5, recently it has become feasible to
`sequence the expressed genes (‘transcriptomes’)6,7,
`known exons (‘exomes’)8,9, and complete genomes10–15
`of cancer samples.
`These technological advances are important for
`advancing our understanding of malignant neoplasms
`because cancer is fundamentally a disease of the genome.
`A wide range of genomic alterations — including point
`mutations, copy number changes and rearrangements
`— can lead to the development of cancer. Most of these
`alterations are somatic, that is, they are present in cancer
`cells but not in a patient’s germ line16.
`An impetus for studies of somatic genome altera-
`tions, which are the focus of this Review, is the poten-
`tial for therapies targeted against the products of these
`alterations. For example, treatment with the inhibitors
`of the epidermal growth factor receptor kinase (EGFR),
`gefitinib and erlotinib, leads to a significant survival
`benefit in patients with lung cancer whose tumours
`carry EGFR mutations, but no benefit in patients
`whose tumours carry wild-type EGFR17–19. Therefore,
`
`comprehensive genome-based diagnosis of cancer is
`becoming increasingly crucial for therapeutic decisions.
`During the past decades, there have been major
`advances in experimental and informatic methods
`for genome characterization based on DNA and RNA
`microarrays and on capillary-based DNA sequenc-
`ing (‘first-generation sequencing’, also known as Sanger
`sequencing). These technologies provided the ability
`to analyse exonic mutations and copy number altera-
`tions and have led to the discovery of many important
`alterations in the cancer genome20.
`However, there are particular challenges for the
`detection and diagnosis of cancer genome alterations.
`For example, some genomic alterations in cancer are
`prevalent at a low frequency in clinical samples, often
`owing to substantial admixture with non-malignant
`cells. Second-generation sequencing can solve such
`problems21. Furthermore, these new sequencing meth-
`ods make it feasible to discover novel chromosomal rear-
`rangements22 and microbial infections23–25 and to resolve
`copy number alterations at very high resolution22,26.
`At the same time, the avalanche of data from second-
`generation sequencing provides a statistical and com-
`putational challenge: how to separate the ‘wheat’ of
`causative alterations from the ‘chaff’ of noise caused by
`alterations in the unstable and evolving cancer genome.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 11 | OCTOBER 2010 | 685
`
`
`
`© 20 Macmillan Publishers Limited. All rights reserved10
`
`Page 685
`
`FOUNDATION EXHIBIT 1042
`IPR2019-00634
`
`
`
`R E V I E W S
`
`First-generation sequencing
`(also known as Sanger
`sequencing or capillary
`sequencing). The standard
`sequencing methodology used
`to sequence the reference
`human (and other model
`organism) genomes. It uses
`radioactively or fluorescently
`labelled dideoxynucleotide
`triphosphates (ddNTPs) as
`DNA chain terminators. Various
`detection methods allow
`read-out of sequence according
`to the incorporation of each
`specific terminator (ddATP,
`ddCTP, ddGTP or ddTTP).
`
`Whole-genome
`amplification
`Various molecular techniques
`(including multiple
`displacement amplification,
`rolling circle amplification or
`degenerate oligonucleotide
`primed PCR) in which very
`small amounts (nanograms)
`of a genomic DNA sample
`can be multiplied in a largely
`unbiased fashion to produce
`suitable quantities for genomic
`analysis (micrograms).
`
`Moore’s law
`The observation made in
`1965 by Gordon Moore that
`the number of transistors per
`square inch on integrated
`circuits had doubled every
`other year since the integrated
`circuit was invented.
`
`This challenge is likely to be solved in part by system-
`atic analyses of large cancer genome data sets that
`will provide sufficient statistical power to overcome
`experimental and biological noise27,28.
`In this Review, we discuss the key challenges in can-
`cer genome sequencing, the methods that are currently
`available and their relative values for detecting differ-
`ent types of genomic alteration. We then summarize the
`main points to consider in the computational analysis
`of cancer genome sequencing data and comment on the
`future potential for using genomics in cancer diagnosis.
`Cancer genome sequencing is a rapidly moving field,
`so in this Review we aim to set out the principles and
`important methodological considerations, with a brief
`summary of important findings to date.
`
`Cancer-specific considerations
`Cancer samples and cancer genomes have general char-
`acteristics that are distinct from other tissue samples
`and from genomic sequences that are inherited through
`the germ line. These require particular consideration in
`second-generation sequencing analyses.
`
`Characteristics of cancer samples for genomic analysis.
`Cancer samples differ in their quantity, quality and purity
`from the peripheral blood samples that are used for germ-
`line genome analysis. Surgical resection specimens tend
`to be large and have been the mainstay of cancer genome
`analysis. However, diagnostic biopsies from patients with
`disseminated disease tend to contain few cells — as surgi-
`cal cure is not possible in these cases, minimizing biopsy
`size is a safety consideration. Therefore, the quantity of
`nucleic acids available may be limiting; obtaining sequence
`information from such biopsies will require decreasing the
`minimum inputs for second-generation sequencing. An
`alternative approach to sequencing from small samples is
`whole-genome amplification, but this method does not pre-
`serve genome structure and can give rise to artefactual
`nucleotide sequence alterations29.
`Nucleic acids from cancer are also often of lower
`quality than those purified from peripheral blood. One
`reason for this is technical: most cancer biopsy and
`resection specimens are formalin-fixed and paraffin-
`embedded (FFPE) to optimize the resolution of micro-
`scopic histology. Nucleic acids from FFPE specimens
`are likely to have undergone crosslinking and also may
`be degraded30. Second-generation sequence analy-
`sis of FFPE-derived nucleic acids can require special
`experimental 31 and computational methods to han-
`dle an increased background mutation rate32,33. A sec-
`ond reason for this difference in nucleic acid quality is
`biological: cancer specimens often include substantial
`fractions of necrotic or apoptotic cells that reduce the
`average nucleic acid quality, therefore, experimental
`methods should also be adapted to account for this. The
`many-fold coverage made possible by second-generation
`sequencing, however, can allow high-quality data to be
`produced from lower quality samples21.
`Finally, cancer nucleic acid specimens are less pure
`than specimens used to analyse the inherited genome,
`especially in terms of genomic DNA purity. The samples
`
`generally used for germline genome analysis — periph-
`eral blood mononuclear cells — are known to be hetero-
`geneous only at the rearranged immunoglobulin and T
`cell receptor loci in a subset of cells. By contrast, a cancer
`specimen contains a mixture of malignant and non-
`malignant cells and, therefore, a mixture of cancer and
`normal genomes (and transcriptomes). Furthermore,
`the cancers themselves may be highly heterogeneous
`and composed of different clones that have different
`genomes34. Cancer genome analytical models must take
`these two types of heterogeneity (cancer versus normal
`heterogeneity and within-cancer heterogeneity) into
`account in their prediction of genome alterations.
`
`Structural variability of cancer genomes. Cancer
`genomes are enormously diverse and complex. They
`vary substantially in their sequence and structure com-
`pared to normal genomes and among themselves. To
`paraphrase Leo Tolstoy’s famous first line from Anna
`Karenina: normal human genomes are all alike, but every
`cancer genome is abnormal in its own way.
`Specifically, cancer genomes vary considerably in
`their mutation frequency (degree of variation compared
`to the reference sequence), in global copy number or
`ploidy, and in genome structure. These variations have
`several implications for cancer genome analysis: the
`presence of a somatic mutation is not enough to establish
`statistical significance as it must be evaluated in terms
`of the sample-specific background mutation rate, which
`can vary at different types of nucleotides (discussed
`further below). The analysis of mutations must also be
`adjusted for the ploidy and the purity of each sample
`and the copy number at each region. For example, if 50%
`of the tumour DNA is derived from cancer cells and a
`mutation is present on 1 of 4 copies of chromosome 11,
`the frequency of that mutation will be 12.5% in the
`sample. Similar considerations apply to the detection of
`somatic rearrangements.
`To identify somatic alterations in cancer, comparison
`with matched normal DNA from the same individual is
`essential. This is largely owing to our incomplete knowl-
`edge of the variations in the normal human genome; to
`date, each ‘matched normal’ cancer genome sequence
`has identified large numbers of mutations and rear-
`rangements in the germ line that had not been previously
`described11–15,35. In the future, the complete characteriza-
`tion of many thousands of normal human genomes may
`obviate this need for a matched normal sample.
`
`Experimental approaches
`Second-generation sequencing technologies are based
`on the simultaneous detection of nucleotides in arrayed
`amplified DNA products originating from single DNA
`molecules36. Specific methods include picotitre-plate
`pyrosequencing3,5, single-nucleotide fluorescent base
`extension with reversible terminators1 and ligation-
`based sequencing2,4. Thanks to advances in sequencing
`approaches that include these technologies, the number of
`bases that can be sequenced for a given cost has increased
`one millionfold since 1990, more than doubling every year,
`which is twice as fast as Moore’s law for semiconductors37.
`
`686 | OCTOBER 2010 | VOLUME 11
`
` www.nature.com/reviews/genetics
`
`
`
`© 20 Macmillan Publishers Limited. All rights reserved10
`
`Page 686
`
`
`
`R E V I E W S
`
`Table 1 | Whole-genome sequencing studies of cancer
`Study
`Method
`Cancer type
`
`Deep single-end
`Ley et al.,
`whole-genome sequencing
`2008
`Shallow paired-end
`Campbell et al.,
`whole-genome sequencing
`2008
`Shallow paired-end
`Stephens et al.,
`whole-genome sequencing
`2009
`Deep paired-end
`Pleasance et al.,
`whole-genome sequencing
`2010
`Deep paired-end
`Pleasance et al.,
`whole-genome sequencing
`2010
`Deep paired-end
`Mardis et al.,
`whole-genome sequencing
`2009
`Deep paired-end
`Shah et al.,
`whole-genome sequencing
`2009
`Deep paired-end
`Ding et al.,
`whole-genome sequencing
`2010
`Deep paired-end
`Lee et al.,
`whole-genome sequencing
`2010
`AML, acute myelogenous leukaemia.
`
`AML
`
`Lung
`
`Breast
`
`Melanoma
`
`Small-cell lung
`
`AML
`
`Breast
`
`Breast
`
`Lung
`
`Number
`of samples
`sequenced
`1
`
`2
`
`24
`
`1
`
`1
`
`1
`
`1
`
`1
`
`1
`
`Aberration type
`
`Refs
`
`Point mutations, insertions, deletions
`
`Deletions, amplifications, tandem duplications,
`interchromosomal rearrangements
`Deletions, amplifications, tandem duplications,
`interchromosomal rearrangements, inversions
`Point mutations, insertions, deletions, amplifications,
`interchromosomal rearrangements
`Point mutations, insertions, deletions, amplifications,
`interchromosomal rearrangements
`Point mutations, insertions, deletions, amplifications,
`interchromosomal rearrangements
`Point mutations, insertions, deletions, amplifications,
`interchromosomal rearrangements
`Point mutations, insertions, deletions, amplifications,
`interchromosomal rearrangements, inversions
`Point mutations, insertions, deletions, amplifications,
`interchromosomal rearrangements, inversions
`
`10
`
`22
`
`39
`
`12
`
`13
`
`11
`
`15
`
`35
`
`14
`
`The application of second-generation sequencing
`has allowed cancer genomics to move from focused
`approaches — such as single-gene sequencing and array
`analysis — to comprehensive genome-wide approaches.
`Second-generation sequencing can be applied to cancer
`samples in various ways. These vary by the type of input
`material (for example, DNA, RNA or chromatin), the
`proportion of the genome targeted (the whole genome,
`transcriptome or a subset of genes) and the type of vari-
`ation studied (structural change, point mutation, gene
`expression or chromosomal conformation). In this
`section, we briefly introduce the main approaches to
`second-generation sequencing of cancer and their asso-
`ciated experimental considerations. Chromatin immuno-
`precipitation followed by sequencing (ChIP–seq) is an
`important complement to cancer genomics but is not
`discussed as it has been reviewed elsewhere38. Key whole-
`genome sequencing studies to date are summarized
`in TABLE 1.
`Compared with previous sequencing methods,
`which are analogue, second-generation sequencing is
`digital: it is possible to count alleles at any nucleotide or
`reads at any alignable position in the genome. Its digital
`nature gives rise to one of the key features of second-
`generation sequencing, the ability to over-sample the
`genome or other nucleic acid compartment that is tar-
`geted10. Over-sampling provides highly accurate sequence
`information by providing enough signal to overcome
`experimental noise and also allows detection of muta-
`tions and other genome alterations in heterogeneous
`samples such as cancer tissues.
`
`Whole-genome sequencing. The first whole cancer
`genome sequence was reported in 2008, a descrip-
`tion of the nucleotide sequence of DNA from an acute
`
`myeloid leukaemia compared with DNA from normal
`skin from the same patient10. Since then, six more
`complete sequences of cancer genomes together with
`matched normal genomes have been reported11–15,35,
`and this number will grow rapidly.
`Complete sequencing of the genome of cancer tis-
`sue to high redundancy, using germline DNA sequence
`from the same patient as a comparison, has the power to
`discover the full range of genomic alterations — includ-
`ing nucleotide substitutions, structural rearrange-
`ments, and copy number alterations — using a single
`approach10–15,35. Therefore, whole-genome sequencing
`provides the most comprehensive characterization
`of the cancer genome but, as it requires the greatest
`amount of sequencing, it is the most costly. Alternative,
`lower-cost approaches include shotgun sequencing with
`incomplete coverage (for example, less than 30-fold
`coverage; see below) — which is sufficient to identify
`somatic rearrangements in the genome22,39 and copy
`number alterations22,26 — and exome and transcriptome
`sequencing, which are described below.
`The major potential of whole-genome sequencing
`for cancer is the discovery of chromosomal rearrange-
`ments. Previously, there were no systematic approaches
`to study solid tumours that have complex karyotypes.
`Therefore, until recently it was thought that chromo-
`somal translocations were rare in epithelial tumours and
`found only in haematological malignancies in which
`they could be observed with cytogenetic methods40,41.
`However, the discoveries of the transmembrane protease
`serine 2 (TMPRSS2)–ERG translocations in prostate
`carcinoma42 and the echinoderm microtubule-associated
`protein like 4 (EML4)–anaplastic lymphoma recep-
`tor tyrosine kinase (ALK) translocations in non-small
`cell lung carcinoma43 have changed that view.
`
`Chromatin
`immunoprecipitation
`A technique used to identify
`the location of DNA-binding
`proteins and epigenetic marks
`in the genome. Genomic
`sequences containing the
`protein of interest are enriched
`by binding soluble DNA
`chromatin extracts (complexes
`of DNA and protein) to an
`antibody that recognizes the
`protein or modification.
`
`Over-sampling
`Reading the same stretch of
`DNA sequence many times to
`gain a confident sequence
`read-out.
`
`Shotgun sequencing
`Sequencing randomly derived
`fragments of the whole
`genome. The order and
`orientation of the sequences
`are determined by mapping
`individual reads back to a
`reference or through assembly
`of overlapping sequences into
`larger contigs of sequence.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 11 | OCTOBER 2010 | 687
`
`
`
`© 20 Macmillan Publishers Limited. All rights reserved10
`
`Page 687
`
`
`
`R E V I E W S
`
`a
`
`Chr 1q21
`
`b
`
`Chr 1q21
`
`c
`
`Chr 1q21
`
`Sequence = 4
`
`Physical = 4
`
`Sequence = 2
`
`Physical = 4
`
`Chr 2q12
`
`Chr 2q12
`
`Chr 2q12
`
`Sequence = 1
`
`Physical = 7
`
`Nature Reviews | Genetics
`Figure 1 | Depth of coverage and physical coverage. To illustrate considerations
`regarding depth of coverage and physical coverage, a rearrangement between human
`chromosome 1q21 and chromosome 2q12 is shown. Sequenced DNA fragments are
`represented by coloured bars: single-end sequencing is shown in a; paired-end
`sequencing is shown in b and c, in which the bars and the dashed lines indicate the
`sequenced ends and unsequenced part, respectively. Blue bars map to chromosome 1
`and purple bars to chromosome 2. Three different scenarios (a–c) are depicted that
`vary in the length of the DNA fragments that are sequenced. In each scenario, the
`sequence and physical coverage at the rearrangement site is shown below. Sequence
`coverage represents the number of sequenced reads that cover the site; this affects
`the ability to detect point mutations. Physical coverage measures the number of
`fragments that span the site; this affects the ability to detect the rearrangement,
`based on paired reads that map to different chromosomes. In cases in which the entire
`fragment is sequenced, as in a, the sequence and physical coverage are the same.
`
`In addition to rearrangements between unique, align-
`able sequences, whole-genome sequencing may be able
`to detect other types of genomic alterations that have
`not been observable using previous methods. Among
`the most important of such events are somatic mutations
`of non-coding regions, including promoters, enhancers,
`introns and non-coding RNAs (including microRNAs),
`as well as unannotated regions. Other novel types of
`alterations in cancer may include rearrangements
`of repetitive elements, and recent studies have suggested
`that active retrotransposons in the human genome might
`contribute to cancer, so whole-genome sequencing
`would be informative in this regard44,45.
`Two important issues to consider when planning
`whole-genome sequencing experiments are depth of
`coverage and physical coverage. Sequence depth is meas-
`ured by the amount of over-sampling: typically, to detect
`nucleotide alterations with high sensitivity, the 3 billion
`bases of the human genome are covered at least 30-fold
`on average, requiring the generation of 90 billion bases of
`
`Jumping library
`A method of library
`construction in which the
`genome is divided into large
`fragments using a rare cutter
`enzyme. Fragments are
`circularized and DNA
`sequences are read from
`the ends of the fragment,
`without reading the
`intervening sequence.
`
`sequence data per sample10–15,35. For cancer samples, this
`number needs to be increased to account for the decreased
`purity and often increased ploidy of each sample.
`Physical coverage is important for detecting rear-
`rangements and this detection is aided by analysis of
`‘paired reads’. In standard shotgun library methods, the
`fragments of DNA are typically 200–400 bases long, and
`second-generation sequencing technologies currently
`yield 50–100 base reads from each end of a fragment
`(known as paired reads). The expected distance between
`the paired reads is used to uniquely place the reads
`on the reference genome and unexpected read pairing
`can be used to detect structural anomalies.
`The distance between the paired reads can be
`increased to thousands of bases by the creation of jumping
`libraries, which can be constructed by generating large
`circular fragments of DNA4,13. This leads to higher physi-
`cal coverage of the genome with less sequence cover-
`age and, consequently, lower cost. For example, with
`3 kb spacing between pairs, the physical coverage of the
`genome is 10 times higher than with 300 bp inserts, so
`equivalent physical coverage can be obtained with 10
`times less sequence coverage (FIG. 1). Although powerful
`for the detection of structural rearrangements, the jump-
`ing library approach has two main limitations. First, with
`less total sequence, the coverage at any given position is
`lower, therefore the sensitivity to observe base changes
`such as point mutations is correspondingly lower.
`Second, the jumping library approach requires large
`quantities of high-quality input DNA, which may not
`be possible with all clinical cancer samples, especially
`those derived from FFPE specimens.
`
`Exome sequencing. Targeted sequencing approaches
`have the general advantage of increased sequence cov-
`erage of regions of interest — such as coding exons of
`genes — at lower cost and higher throughput compared
`with random shotgun sequencing, Most large-scale
`methods for targeted sequencing use a variation of a
`hybrid selection approach (FIG. 2): nucleic acid ‘baits’ are
`used to ‘fish’ for regions of interest in the total pool of
`nucleic acids, which can be DNA46–49 or RNA50. Any sub-
`set of the genome can be targeted, including exons, non-
`coding RNAs, highly conserved regions of the genome
`or other regions of interest.
`Analysis of selected sets of exons using capillary-based
`sequencing has been a powerful and effective approach
`to focus DNA sequencing efforts on the coding genes of
`greatest interest. For example, capillary sequencing
`of exons from specific gene families has led to the discov-
`ery of activating somatic mutations in various cancers,
`such as the BRAF serine–threonine kinase51, the EGFR,
`ERBB2, fibroblast growth factor receptor 2 (FGFR2),
`JAK2, and ALK receptor tyrosine kinases52–66, and the
`PIK3CA and PIK3R1 lipid kinase subunits28,67. Whole-
`exome sequencing with capillary sequencing allowed the
`analysis of all known coding genes in colorectal, breast and
`pancreatic carcinomas and glioblastoma68–71. These studies
`have led to the discovery of somatic mutations in iso-
`citrate dehydrogenase 1 (IDH1) in glioblastoma69 and of
`germline mutations in the gene encoding partner and
`
`688 | OCTOBER 2010 | VOLUME 11
`
` www.nature.com/reviews/genetics
`
`
`
`© 20 Macmillan Publishers Limited. All rights reserved10
`
`Page 688
`
`
`
`Tumour
`material
`
`R E V I E W S
`
`Matched
`normal
`(blood)
`
`DNA isolation
`
`DNA isolation
`
`Tumour DNA
`
`Normal DNA
`
`Tumour DNA
`(pond)
`
`+
`
`Gene-specific
`oligonucleotides
`(baits)
`
`Normal DNA
`(pond)
`
`+
`
`Gene-specific
`oligonucleotides
`(baits)
`
`Hybridization
`
`Hybridization
`
`Elute
`
`Elute
`
`Sequencing
`
`Sequencing
`
`Alignment
`
`Alignment
`
`Gene
`(reference
`sequence)
`
`AA
`
`AA
`
`AA
`AA
`
`Somatic mutation ‘A’, evidence in tumour, none in normal
`
`Nature Reviews | Genetics
`Figure 2 | Sequence capture for cancer genomics. A schematic diagram of hybrid selection to capture specific
`regions of the genome from tumour DNA (left panel, blue) and normal DNA (right panel, red). DNA from the starting
`material (the ‘pond’) is sheared and hybridized to oligonucleotides that are specific for the regions of interest (for
`example, exons in genes from a particular pathway or the whole exome; the ‘baits’). The baits have a tag that allows
`them to be isolated (for example, by immobilization on beads). The captured DNA is eluted, prepared into sequencing
`libraries, sequenced and aligned to the bait sequences. Because this technique allows greater depth of coverage for
`the regions of interest, somatic mutations in the tumour DNA can be detected from admixed populations containing
`tumour and normal DNA-derived reads.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 11 | OCTOBER 2010 | 689
`
`
`
`© 20 Macmillan Publishers Limited. All rights reserved10
`
`Page 689
`
`
`
`R E V I E W S
`
`localizer of BRCA2 (PALB2) in patients with pancreatic
`carcinoma72, among other important findings.
`However, second-generation sequencing is a more
`efficient and comprehensive technology for whole-
`exome sequence analysis than capillary-based sequenc-
`ing and is becoming increasingly routine8,9. Because the
`exome represents only approximately 1% of the genome,
`or about 30 Mb, vastly higher sequence coverage can be
`readily achieved using second-generation sequencing
`platforms with considerably less raw sequence and cost
`than whole-genome sequencing. For example, whereas
`90 Gb of sequence is required to obtain 30-fold aver-
`age coverage of the genome, 75-fold average coverage is
`achieved for the exome with only 3 Gb of sequence using
`the current state-of-the-art platforms for targeting73.
`However, there are inefficiencies in the targeting proc-
`ess. For example, uneven capture efficiency across
`exons can mean that not all exons are sequenced and
`some off-target hybridization can occur. These inef-
`ficiencies are likely to be ameliorated as sequencing
`and capture technology continue to improve.
`The higher coverage of the exome that can be affordably
`achieved for a large number of samples makes exome
`sequencing highly suitable for mutation discovery in
`cancer samples of mixed purity. In addition, the hybrid
`selection approach will be particularly powerful for
`diagnostic analysis of the cancer genome; for diagnosis,
`there may be interest in sequencing specific oncogenes74
`and/or tumour suppressor genes at very high coverage in
`samples with a low percentage of tumour cells21.
`
`Transcriptome sequencing. Second-generation sequencing
`of the transcriptome (RNA-seq) — as cDNA derived
`from mRNA, total RNA or other RNAs such as micro-
`RNAs — is a powerful approach for understanding can-
`cer. Transcriptome sequencing is a sensitive and efficient
`approach to detect intragenic fusions, including in-frame
`fusion events that lead to oncogene activation6,7,75,76.
`Transcriptome sequencing can also be used to detect
`somatic mutations but finding a matched normal sample
`for comparison is a challenge, as normal tissue is unlikely
`to express exactly the same genes as the tumour sample.
`Furthermore, mutation detection in genes expressed at
`low levels is hampered owing to lack of statistical power.
`Also, the possibilities of reverse transcriptase errors
`and RNA editing15 need to be considered. Nevertheless,
`important somatic nucleotide substitution mutations
`have been discovered by transcriptome sequencing, most
`notably recurrent mutations in the forkhead box L2 gene
`(FOXL2) in ovarian granulosa cell tumours77.
`RNA-seq also allows analysis of gene expression pro-
`files and is particularly powerful for identifying tran-
`scripts with low-level expression, which means that these
`transcripts can be included in tumour classification
`metrics78. RNA-seq may soon be competitive with oligo-
`nucleotide microarray technologies in terms of the cost
`and efficiency of gene expression analysis. Furthermore,
`transcriptome sequencing provides the advantage of not
`being limited to known genes but can also include the
`detection of novel transcripts, alternative splice forms
`and non-human transcripts.
`
`Detecting classes of genome alterations
`In contrast to previously available genome technologies,
`such as first-generation sequencing and array-based
`methods, second-generation sequencing methods can
`provide a comprehensive picture of the cancer genome
`by detecting each of the major alterations in the cancer
`genome (FIG. 3). Here we describe the analysis of each
`type of alteration briefly.
`
`Somatic nucleotide substitutions and small insertion
`and deletion mutations. Nucleotide substitution muta-
`tions are the most common known somatic genomic
`alteration in cancer, occurring typically at the rate of
`about one somatic nucleotide substitution per million
`nucleotides12,13,15,28,79; insertion and deletion mutations
`are approximately tenfold less common in most can-
`cer specimens. However, the rate of mutations varies
`greatly between cancer specimens. For example, ultra-
`violet radiation-induced melanomas have on the order
`of ten mutations per million bases12 and hypermutated
`tumours with defects in DNA repair genes can reach
`rates of tens of mutations per million bases28,79. By con-
`trast, haematopoietic malignancies can have less than
`one mutation per million bases10,11. Therefore, statistical
`analyses to assess mutation significance must take these
`sample-to-sample variations into account.
`Various computational methods have been devel-
`oped to determine the presence of somatic mutations
`using second-generation sequence data80. The detection
`of somatic mutations in cancer requires mutation call-
`ing in both the tumour DNA and the matched normal
`DNA, coupled with comparison to a reference genome
`and an assessment of the statistical significance of the
`number of counts of the mutation in the cancer sequence
`and its absence in the matched normal sequence. False-
`positive genome alteration calls are of two types: inac-
`curate detection of an event in the tumour, when the
`tumour and normal are both wild-type; and detection of
`a germline event in the tumour but failure to detect it in
`the normal. Different sources of noise contribute to the
`two types of false positives. The first type of error can
`be due to machine-sequencing errors, incorrect local
`alignment of individual reads and discordant alignment
`of pairs. Stochastic errors such as machine errors can
`be eliminated by high-level over-sampling of tumour
`and normal DNA sequence with sufficiently stringent
`statistical thresholds for mutation calling. The second
`type of false-positive mutation calls are caused by fail-
`ures to detect the germline alleles that differ from the
`reference sequence in the normal sample, mostly owing
`to insufficient coverage.
`In general, the most common cause of false-negative
`mutation calls is insufficient coverage of the cancer
`DNA. As discussed above, increased over-sampling may
`be required to overcome sample admixture, tumour het-
`erogeneity and variations in ploidy (genome-wide and
`local).
`The identification of candidate mutations associated
`with cancer then leads to two questions: is the specific
`mutation or the set of alterations in a particular gene
`statistically significant across all samples, and is the
`
`690 | OCTOBER 2010 | VOLUME 11
`
` www.nature.com/reviews/genetics
`
`
`
`© 20 Macmillan Publishers Limited. All rights reserved10
`
`Page 690
`
`
`
`R E V I E W S
`
`Chr 5
`
`Non-human
`sequence
`
`Reference sequence
`Chr 1
`
`cccccccccccccccA
`
`Point mutation
`
`Indel
`
`Homozygous
`deletion
`
`Hemizygous
`deletion
`
`Gain
`
`Translocation
`breakpoint
`
`Pathogen
`
`Copy number alterations
`Figure 3 | Types of genome alterations that can be detected by seco