throbber
R E V I E W S
`
` S T U DY D E S I G N S
`
`Genotype and SNP calling from
`next-generation sequencing data
`
`Rasmus Nielsen*‡§, Joshua S. Paul||, Anders Albrechtsen‡ and Yun S. Song§||
`
`Abstract | Meaningful analysis of next-generation sequencing (NGS) data, which are
`produced extensively by genetics and genomics studies, relies crucially on the accurate
`calling of SNPs and genotypes. Recently developed statistical methods both improve
`and quantify the considerable uncertainty associated with genotype calling, and will
`especially benefit the growing number of studies using low- to medium-coverage data.
`We review these methods and provide a guide for their use in NGS studies.
`
`Likelihoods
`Functions expressing the
`probability of observing
`the data — for example,
`next-generation sequencing
`data — given a parameter,
`such as a genotype or an
`allele frequency.
`
`*Department of Integrative
`Biology, University of
`California, Berkeley,
`California 94720, USA.
`‡Centre for Bioinformatics,
`University of Copenhagen,
`Universitetsparken 15, 2100
`Copenhagen Ø, Denmark.
`§Department of Statistics,
`University of California,
`Berkeley, California 94720,
`USA.
`||Department of Electrical
`Engineering and Computer
`Sciences, University of
`California, Berkeley,
`California 94720, USA.
`Correspondence to R.N. 
`and Y.S.S. 
`e‑mails: rasmus_nielsen@
`berkeley.edu; yss@stat.
`berkeley.edu
`doi:10.1038/nrg2986
`
`Next-generation sequencing (NGS) methods1 provide
`cheap and reliable large-scale DNA sequencing. They
`are used extensively for de novo sequencing2, for disease
`mapping3, for quantifying expression levels through RNA
`sequencing4–6 and in population genetic studies7–9.
`In NGS methods, a whole genome, or targeted
`regions of the genome, is randomly digested into small
`fragments (or short reads) that get sequenced and are
`then either aligned to a reference genome or assembled10.
`Having aligned the fragments of one or more individuals
`to a reference genome, ‘SNP calling’ identifies variable
`sites, whereas ‘genotype calling’ determines the genotype
`for each individual at each site.
`NGS data can suffer from high error rates that are
`due to multiple factors, including base-calling and align-
`ment errors. Moreover, many NGS studies rely on low-
`coverage sequencing (<5× per site per individual, on
`average), for which there is high probability that only one
`of the two chromosomes of a diploid individual has been
`sampled at a specified site. Under such circumstances,
`accurate SNP calling and genotype calling are difficult,
`and there is often considerable uncertainty associated
`with the results. It is crucial to quantify and account for
`this uncertainty, as it influences downstream analyses
`based on the inferred SNPs and genotypes, such as the
`identification of rare mutations, the estimation of allele
`frequencies and association mapping.
`One method for reducing uncertainty associated with
`genotype and SNP calling is to sequence target regions
`deeply (at >20× coverage). However, the ever-increasing
`demand for larger samples suggests that medium-
`(5–20×) or low-coverage sequencing will be the most
`common and cost-effective study design in many appli-
`cations of NGS for years to come. For example, the 1000
`Genomes Project pilot phase9 relied on approximately
`
`3× coverage to sequence 176 individuals genome-wide.
`For the identification of low-frequency variants, this
`design is more cost-efficient than deeper sequencing in
`fewer individuals. Likewise, in association studies, map-
`ping power is typically maximized by sequencing many
`individuals at low depth11, rather than sequencing
`fewer individuals at a high depth.
`Alternatively, reducing and quantifying the uncer-
`tainty associated with SNP and genotype calling may
`be accomplished using sophisticated algorithms; there-
`fore, these have recently been the subject of extensive
`research9,12–15. Most contemporary algorithms use a
`probabilistic framework. So-called ‘genotype likelihoods’
`— which incorporate errors that may have been intro-
`duced in base calling, alignment and assembly — are
`coupled with prior information, such as allele frequen-
`cies and patterns of linkage disequilibrium (LD). The
`result is a SNP and genotype call and an associated
`measure of uncertainty (which is often described by a
`‘quality score’), both of which have a concrete statistical
`interpretation.
`Here we review this research and provide general
`guidelines for genotype and SNP calling in NGS stud-
`ies. Converting the raw output of NGS technology into a
`final set of SNP and genotype data involves a number of
`steps (summarized in FIG. 1), each of which contributes
`to the accuracy of the final SNP and genotype calls. We
`start at the beginning of this process by briefly review-
`ing recent developments in the methods used for base
`calling and alignment. We then review and discuss sev-
`eral recent algorithms for SNP and genotype calling and
`address how the uncertainties in the resulting calls can
`be accommodated in downstream analyses. Finally, we
`make some general recommendations for the analysis
`of NGS data.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 12 | JUNE 2011 | 443
`
`© 2011 Macmillan Publishers Limited. All rights reserved
`
`Page 443
`
`FOUNDATION EXHIBIT 1030
`IPR2019-00634
`
`

`

`R E V I E W S
`
`Base calling and alignment
`The main principle underlying NGS technologies is
`sequencing-by-synthesis. In brief, tens-to-hundreds
`of millions of clusters of small ssDNA templates are
`‘read’ simultaneously by sequentially building up com-
`plementary bases. The synthesis process is captured in
`a series of fluorescence images, and base-calling algo-
`rithms infer the actual nucleotide information from the
`obtained fluorescence-intensity data for each cluster of
`DNA templates. They then assign a measure of uncer-
`tainty (or quality score) to each base call. The result-
`ing short-read data are then assembled into a genome.
`When a reference genome is available, the primary
`approach used to assemble a newly sequenced genome
`is alignment (also known as ‘read mapping’), in which
`the basic task is to align each short read onto an avail-
`able reference genome. Here we review the key aspects
`of base calling and alignment.
`
`(cid:43)(cid:79)(cid:67)(cid:73)(cid:71)(cid:2)(cid:67)(cid:80)(cid:67)(cid:78)(cid:91)(cid:85)(cid:75)(cid:85)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)(cid:68)(cid:67)(cid:85)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:75)(cid:80)(cid:73)
`
`(cid:52)(cid:71)(cid:67)(cid:70)(cid:2)(cid:79)(cid:67)(cid:82)(cid:82)(cid:75)(cid:80)(cid:73)
`
`(cid:52)(cid:71)(cid:67)(cid:78)(cid:75)(cid:73)(cid:80)(cid:14)(cid:2)(cid:84)(cid:71)(cid:79)(cid:81)(cid:88)(cid:71)(cid:2)(cid:70)(cid:87)(cid:82)(cid:78)(cid:75)(cid:69)(cid:67)(cid:86)(cid:71)(cid:2)(cid:84)(cid:71)(cid:67)(cid:70)(cid:85)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)(cid:84)(cid:71)(cid:69)(cid:67)(cid:78)(cid:75)(cid:68)(cid:84)(cid:67)(cid:86)(cid:71)(cid:2)(cid:83)(cid:87)(cid:67)(cid:78)(cid:75)(cid:86)(cid:91)(cid:2)(cid:85)(cid:69)(cid:81)(cid:84)(cid:71)(cid:85)
`
`(cid:47)(cid:87)(cid:78)(cid:86)(cid:75)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:75)(cid:80)(cid:73)
`
`(cid:53)(cid:75)(cid:80)(cid:73)(cid:78)(cid:71)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:75)(cid:80)(cid:73)
`
`(cid:50)(cid:84)(cid:81)(cid:79)(cid:81)(cid:86)(cid:71)(cid:2)(cid:69)(cid:67)(cid:80)(cid:70)(cid:75)(cid:70)(cid:67)(cid:86)(cid:71)(cid:2)(cid:53)(cid:48)(cid:50)(cid:2)
`(cid:85)(cid:71)(cid:86)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)(cid:73)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:85)(cid:2)(cid:87)(cid:85)(cid:75)(cid:80)(cid:73)(cid:2)
`(cid:80)(cid:81)(cid:80)(cid:15)(cid:78)(cid:75)(cid:80)(cid:77)(cid:67)(cid:73)(cid:71)(cid:15)(cid:68)(cid:67)(cid:85)(cid:71)(cid:70)(cid:14)(cid:2)(cid:79)(cid:87)(cid:78)(cid:86)(cid:75)(cid:15)
`(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)(cid:2)(cid:67)(cid:80)(cid:67)(cid:78)(cid:91)(cid:85)(cid:75)(cid:85)(cid:2)
`
`(cid:52)(cid:71)(cid:558)(cid:80)(cid:71)(cid:2)(cid:69)(cid:67)(cid:80)(cid:70)(cid:75)(cid:70)(cid:67)(cid:86)(cid:71)(cid:2)(cid:53)(cid:48)(cid:50)(cid:2)(cid:85)(cid:71)(cid:86)(cid:2)
`(cid:67)(cid:80)(cid:70)(cid:2)(cid:73)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:75)(cid:80)(cid:73)(cid:2)(cid:87)(cid:85)(cid:75)(cid:80)(cid:73)(cid:2)
`(cid:78)(cid:75)(cid:80)(cid:77)(cid:67)(cid:73)(cid:71)(cid:15)(cid:68)(cid:67)(cid:85)(cid:71)(cid:70)(cid:2)(cid:67)(cid:80)(cid:67)(cid:78)(cid:91)(cid:85)(cid:75)(cid:85)
`
`(cid:43)(cid:70)(cid:71)(cid:80)(cid:86)(cid:75)(cid:72)(cid:91)(cid:2)(cid:53)(cid:48)(cid:50)(cid:85)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)
`(cid:67)(cid:85)(cid:85)(cid:81)(cid:69)(cid:75)(cid:67)(cid:86)(cid:71)(cid:70)(cid:2)(cid:73)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:85)(cid:2)(cid:87)(cid:85)(cid:75)(cid:80)(cid:73)(cid:2)
`(cid:85)(cid:75)(cid:80)(cid:73)(cid:78)(cid:71)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)(cid:2)(cid:67)(cid:80)(cid:67)(cid:78)(cid:91)(cid:85)(cid:75)(cid:85)
`
`(cid:53)(cid:48)(cid:50)(cid:2)(cid:558)(cid:78)(cid:86)(cid:71)(cid:84)(cid:75)(cid:80)(cid:73)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)(cid:53)(cid:48)(cid:50)(cid:2)(cid:81)(cid:84)(cid:2)(cid:73)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:2)(cid:83)(cid:87)(cid:67)(cid:78)(cid:75)(cid:86)(cid:91)(cid:2)(cid:85)(cid:69)(cid:81)(cid:84)(cid:71)(cid:2)(cid:84)(cid:71)(cid:69)(cid:67)(cid:78)(cid:75)(cid:68)(cid:84)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80)
`
`Figure 1 | Steps for converting raw next-generation
`sequencing data into a final set of SNP or genotype
`calls. Pre-processing steps (shown in yellow) transform
`(cid:48)(cid:67)(cid:86)(cid:87)(cid:84)(cid:71)(cid:2)(cid:52)(cid:71)(cid:88)(cid:75)(cid:71)(cid:89)(cid:85)(cid:2)(cid:94)(cid:2)(cid:41)(cid:71)(cid:80)(cid:71)(cid:86)(cid:75)(cid:69)(cid:85)
`the raw data from next-generation sequencing technology
`into a set of aligned reads that have a measure of
`confidence, or quality score, associated with the bases
`of each read. The per-base quality scores produced by
`base-calling algorithms may need to be recalibrated to
`accurately reflect the true error rates. Depending on the
`number of samples and the depth of coverage, either a
`multi-sample calling procedure (green) or a single-sample
`calling procedure (orange) may then be applied to obtain
`SNP or genotype calls and associated quality scores.
`Note that the multi-sample procedure may include a
`linkage-based analysis, which can substantially improve
`the accuracy of SNP or genotype calls. Finally,
`post-processing (purple) uses both known data and
`simple heuristics to filter the set of SNPs and/or improve
`the associated quality scores. Optional, although
`recommended, steps are shown in dashed lines.
`
`Base calling and quality scores. Base-calling procedures
`vary according to the sequencing platform used, all of
`which are prone to a different type of error. For the 454
`platform, base calling involves inferring the length of
`each homopolymer from the observed fluorescence
`intensity. The main challenge stems from the fact that
`the variance of signal intensity for a specific homopoly-
`mer length is large, resulting in high error rates in inser-
`tion and deletion (indel) calls. For the Illumina platform,
`indel errors are rare, but the overall miscall error rate
`is typically around 1%. Here, the main complication
`arises from the synthesis process becoming desyn-
`chronized between different copies of DNA templates
`in the same cluster. Base calling becomes less accurate in
`later cycles as the extent of asynchrony is exacerbated
`with each sequencing cycle. The SOLiD platform uses
`a two-base encoding scheme in which each fluorescent
`dye colour represents four dinucleotide combinations.
`Each base of the DNA template is examined twice
`in this system and a length m nucleotide sequence is
`represented as a length m – 1 colour sequence. A
`major complication in ‘colour calling’ arises from
`biases in fluorescence intensities that appear in later
`machine cycles.
`In addition to identifying nucleotides, base-calling
`algorithms produce per-base quality scores by using
`noise estimates from image analysis. Some sequencing
`platforms adopt quality values that are defined specifi-
`cally for the platforms, but those quality values can be
`easily transformed into the standard Phred16 quality
`score, given by
`
` QPhred = -10 log10 P(error).
`
`
`
`(1)
`
`Note that a Phred score of 20 corresponds to a 1% error
`rate in base calling.
`The typical error rate of NGS data ranges from a few
`tenths of a per cent to several per cent, depending on
`the platform. Reducing the error rate of base calls and
`improving the accuracy of the per-base quality score
`have important implications for assembly, polymor-
`phism detection and downstream population-genomic
`analyses. As such, several base-calling algorithms have
`been developed to optimize data acquisition for the
`more widely used NGS platforms: examples include
`Pyrobayes17 for the 454 platform; Rsolid18 for the
`SOLiD platform; and Ibis19 and BayesCall20,21 for
`the Illumina platform. These algorithms provide
`~5–30% improvement in error rates over the base-
`calling methods developed by the manufacturers of the
`NGS platforms, and it has been shown that improved
`base-call accuracy can lead to a significant reduction in
`false-positive SNP calls and facilitate assembly when the
`coverage is low to moderate. However, some of the new
`methods tend to be either too computationally inten-
`sive to be of broad practical use or have not been tested
`thoroughly. Although more-accurate image analysis and
`base-calling algorithms for NGS platforms continue to
`be developed, the default software packages currently
`accompanying NGS platforms are the ones that are most
`widely adopted by users.
`
`444 | JUNE 2011 | VOLUME 12
`
` www.nature.com/reviews/genetics
`
`© 2011 Macmillan Publishers Limited. All rights reserved
`
`Page 444
`
`

`

`R E V I E W S
`
`Alignment and assembly. The accuracy of the align-
`ment has a crucial role in variant detection. Incorrectly
`aligned reads may lead to errors in SNP and genotype
`calling, so it is important for alignment algorithms to
`be able to cope with sequencing errors, as well as with
`potentially real differences (both point mutations and
`indels) between the reference genome and the sequenced
`genome that are due to polymorphisms. Furthermore,
`it is important for aligners to produce well-calibrated
`alignment (or mapping) quality values, as variant calls
`and their posterior probabilities depend on those scores.
`The amount of sequence identity required between
`each read and the reference sequence is determined by a
`trade-off between accuracy and read depth. The optimal
`choice of the tolerable number of mismatches may differ
`between different organisms. For example, as popula-
`tions of Drosophila melanogaster are more variable than
`human populations, using mapping criteria that are opti-
`mized for analyses of human sequences may lead to a
`severe loss of sequencing depth in D. melanogaster. This,
`in turn, may lead to a potential for biases in the down-
`stream analyses, as regions that harbour many natural
`polymorphisms will be underrepresented. Likewise,
`using alignment criteria that are appropriate for fruitflies
`in humans would lead to a large amount of incorrectly
`aligned reads.
`Most alignment algorithms for NGS data are based
`on either ‘hashing’ or an effective data compression algo-
`rithm called the ‘Burrows–Wheeler transform’ (BWT)22.
`BWT-based aligners (for example, Bowtie23, SOAP2
`(REF. 15) and BWA24) are fast, memory-efficient and par-
`ticularly useful for aligning repetitive reads; however,
`they tend to be less sensitive than the state-of-the-art
`hash-based algorithms (for example, MAQ12, Novoalign
`and Stampy25). The Novoalign and Stampy aligners cur-
`rently produce the most accurate overall results, while
`also being practical in terms of running time (see REF. 25
`for a detailed comparison of the performance of various
`aligners).
`In general, alignment is more difficult for regions
`with higher levels of diversity between the reference
`genome and the sequenced genome. This difficulty can
`be ameliorated by the use of longer reads and paired-
`end reads (see REF. 25 for further quantitative details).
`However, assembling highly diverse regions such as
`the major histocompatibility complex (MHC) remains
`a challenge. Using de novo assembly algorithms —
`which are based on graph-theoretic ideas26–30 that try to
`exploit overlap information to stitch together the reads
`into contiguous sequences — may provide a viable
`solution to this challenge. Combining such methods
`with alignment to study genetic variation in complex
`regions is likely to be an active area of research in the
`forthcoming years.
`
`Recalibration of per-base quality scores. The raw
`Phred-scaled quality scores produced by base-calling
`algorithms may not accurately reflect the true base-
`calling error rates14,15,31. In such a case, the raw quality
`scores need to be recalibrated so that a Phred score of Q
`more-accurately corresponds to an error rate of 10- Q/10,
`
`as implied by equation 1. Obtaining well-calibrated qual-
`ity scores is important, as SNP and genotype calling at
`a specific position in the genome depends on both the
`base calls and the per-base quality scores of the reads
`overlapping the position.
`In SOAPsnp14,15, per-base quality scores are recali-
`brated by comparing a sequenced genome to the refer-
`ence genome at sites with no known SNPs. A related
`alignment-based recalibration algorithm has been
`implemented in the GATK software32,33, which takes
`into account several covariates such as machine cycle
`and dinucleotide context. For all supposedly non-
`polymorphic sites, the bases that align to those sites are
`put into different categories classified by the following
`features: the raw quality score (produced by base call-
`ing), the position of the base in the read, the dinucle-
`otide context and the read group. For each category, the
`algorithm estimates the empirical quality score by using
`the number of mismatches with respect to the reference
`genome. Recalibrated quality scores are then estimated
`by adding to the raw quality scores the residual differ-
`ences between empirical quality scores and the mis-
`match rates implied by the raw quality scores, which
`are conditioned on various subsets of the features. This
`recalibration algorithm, which is adopted in the 1000
`Genomes Project9, can be applied to various sequenc-
`ing platforms. As described above, the algorithm uses a
`set of supposedly non-polymorphic sites. If a compre-
`hensive database of known SNPs is not available for the
`species under consideration, then one can first identify
`candidate polymorphic sites that are highly likely to
`be real and use the remaining sites in the recalibration
`procedure. In such a case, another round of SNP calling
`should be performed with recalibrated quality scores.
`
`Genotype and SNP calling
`The process of converting base calls and quality scores
`into a set of genotypes for each individual in a sample is
`often divided into two steps: genotype calling and SNP
`calling. SNP calling aims to determine in which posi-
`tions there are polymorphisms or in which positions at
`least one of the bases differs from a reference sequence;
`the latter is also sometimes referred to as ‘variant call-
`ing’. Genotype calling is the process of determining
`the genotype for each individual and is typically only
`done for positions in which a SNP or a ‘variant’ has
`already been called. We use the word ‘calling’ here to
`signify the estimation of one unique SNP or genotype.
`However, we note that some analyses can proceed with-
`out determining the exact identity of each genotype,
`but instead allow uncertainty regarding genotypes to be
`incorporated directly into the analyses.
`Genotype and SNP calling can proceed, as in early
`studies, by counting alleles at each site and using simple
`cutoff rules for when to call a SNP or genotype. More-
`recent methods incorporate uncertainty in a probabilis-
`tic framework. In the probabilistic framework, it is also
`possible to further incorporate additional information
`regarding allele frequencies and/or patterns of LD. We
`review these different approaches below, starting with
`the simple methods based on counting alleles.
`
`Posterior probabilities
`In this context, these are
`the probabilities of a
`particular genotype given
`observed data: they are
`calculated by incorporating
`the information from the
`next-generation sequencing
`data as well as some prior
`information.
`
`Hashing
`A procedure of creating a
`data structure that helps to
`accelerate alignment. It stores
`information about which reads
`or where in the reference
`genome a particular substring
`or subsequence occurs. Some
`hash-based aligners hash the
`reads, while others hash
`the reference genome.
`
`Paired-end reads
`Sequencing of both the
`forward and reverse template
`of a DNA molecule, which is
`possible by inserting a primer
`sequence between the two
`ends of the read. The use of
`this technique greatly helps
`to increase assembly and
`alignment accuracy.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 12 | JUNE 2011 | 445
`
`© 2011 Macmillan Publishers Limited. All rights reserved
`
`Page 445
`
`

`

`to a loss of information regarding individual read quali-
`ties. An additional disadvantage of this type of genotype
`calling is that it typically does not provide measures of
`uncertainty in the genotype inference. For this reason,
`several probabilistic methods have been developed that
`use the quality score to provide a posterior probability
`for each genotype12–15.
`In brief, it is assumed that one can compute a geno-
`type likelihood, p(X | G), for a genotype G. The symbol
`X represents, in this generic notation, all of the read data
`for a particular individual at a particular site. In conjunc-
`tion with a genotype prior, p(G), Bayes’ formula is used
`to calculate p(G | X), which is the posterior probability
`of genotype G. The genotype with the highest posterior
`probability is generally chosen, and this probability, or
`perhaps the ratio between the highest and the second
`highest probabilities, is used as a measure of confidence.
`The advantages of the probabilistic methods are that
`they provide measures of statistical uncertainty when
`calling genotypes, they lead to higher accuracy of geno-
`type calling, and they provide a natural framework for
`incorporating information regarding allele frequencies
`and patterns of LD.
`
`Calculating genotype likelihoods. The genotype like-
`lihood can be calculated using the quality scores for
`each read. Let Xi be the data in read i for a particular
`individual and a particular site with genotype G. The
`probability p(Xi | G) is then given by a simple rescaling
`of the quality score of Xi, and the genotype likelihood,
`p(X | G), can be calculated directly from the data by
`taking the product of p(Xi | G) over all i. There is an
`implicit assumption here of independence among reads,
`which may be violated in the presence of alignment
`errors or PCR artefacts. It has been suggested12 that a
`weighting scheme should be used that takes correlated
`errors into account. The genotype likelihoods can also
`be improved by recalibrating the per-base quality scores
`using empirical data, as discussed in the section regard-
`ing base calling and alignment. When genotype calling
`is preceded by SNP calling, the information from the
`SNP-calling step can be incorporated into the genotype-
`calling algorithm, leading to genotype likelihoods that
`are calculated by conditioning on the site containing a
`polymorphism.
`Martin et al.37 also suggested estimating error rates
`directly from the read data for each site independ-
`ently, instead of using quality scores. The advantage of
`such an approach is that genotype and SNP calling do
`not depend on the accuracy of the calculated quality
`scores. However, a disadvantage is that the considerable
`information regarding errors gained through the base-
`calling and alignment process is lost. These approaches
`are very recent, and no research has been done to sys-
`tematically compare the advantages and disadvantages
`of directly estimating error rates from the data. Ideally,
`error rates should be estimated from the data while
`incorporating information obtained during base calling
`and alignment.
`Methods for calculating genotype likelihoods will
`probably be a topic of much future research.
`
`Early methods for calling genotypes. Early NGS studies
`based both SNP and genotype calling on separate
`analyses of data from each individual sampled. Typically,
`analyses would first involve a filtering step in which only
`high-confidence bases would be kept. The most common
`cutoff used would be a Phred-type quality score of Q20
`(QPhred = 20). Genotype calling would then proceed for
`each individual by counting the number of times each
`allele is observed and using fixed cutoffs. SNP calling
`would then be performed based on the inferred geno-
`types. For example34,35, one would first use a Q20 filter
`and then call a heterozygous genotype if the proportion
`of the non-reference allele is between 20% and 80%; oth-
`erwise, a homozygous genotype would be called. This
`is a fairly standard procedure and works well when the
`sequencing depth is high (>20×), so that the probability
`of a heterozygous individual falling outside the 20–80%
`range is small. Related methods for genotype calling
`form the basis for the commercially available software
`in Roche’s GSMapper, the CLC Genomic Workbench
`and the DNSTAR Lasergene software. These methods
`can be improved by using more empirically determined
`cutoffs (as described in REF. 36).
`
`Probabilistic methods. For moderate or low sequencing
`depths, genotype calling based on fixed cutoffs will typi-
`cally lead to under-calling of heterozygous genotypes and
`the use of a simple filtering based on quality score leads
`
`(cid:41)(cid:35)(cid:54)(cid:45)(cid:115)(cid:36)(cid:71)(cid:67)(cid:73)(cid:78)(cid:71)
`(cid:41)(cid:35)(cid:54)(cid:45)(cid:2)(cid:79)(cid:87)(cid:78)(cid:86)(cid:75)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)
`(cid:41)(cid:35)(cid:54)(cid:45)(cid:2)(cid:85)(cid:75)(cid:80)(cid:73)(cid:78)(cid:71)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)
`
`(cid:19)(cid:16)(cid:18)(cid:18)
`
`(cid:18)(cid:16)(cid:27)(cid:23)
`
`(cid:18)(cid:16)(cid:27)(cid:18)
`
`(cid:18)(cid:16)(cid:26)(cid:23)
`
`(cid:18)(cid:16)(cid:26)(cid:18)
`
`(cid:41)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:2)(cid:67)(cid:69)(cid:69)(cid:87)(cid:84)(cid:67)(cid:69)(cid:91)
`
`(cid:18)
`
`(cid:18)(cid:16)(cid:26)
`
`(cid:18)(cid:16)(cid:22)
`(cid:18)(cid:16)(cid:20)
`(cid:18)(cid:16)(cid:24)
`(cid:50)(cid:84)(cid:81)(cid:82)(cid:81)(cid:84)(cid:86)(cid:75)(cid:81)(cid:80)(cid:2)(cid:81)(cid:72)(cid:2)(cid:80)(cid:81)(cid:80)(cid:15)(cid:69)(cid:67)(cid:78)(cid:78)(cid:85)
`(cid:48)(cid:67)(cid:86)(cid:87)(cid:84)(cid:71)(cid:2)(cid:52)(cid:71)(cid:88)(cid:75)(cid:71)(cid:89)(cid:85)(cid:2)(cid:94)(cid:2)(cid:41)(cid:71)(cid:80)(cid:71)(cid:86)(cid:75)(cid:69)(cid:85)
`Figure 2 | A comparison of three genotype callers.
`A subset of the data (chromosome 20, bases 20,000,000–
`25,000,000) for the 62 CEU individuals in both the
`HapMap Public Release no. 28 and the 1000 Genomes
`Pilot Project was genotype-called using the following
`methods: GATK Unified Genotyper32,33 applied to each
`individual independently (blue); GATK Unified Genotyper
`applied to all individuals collectively (red); and GATK
`Unified Genotyper applied to all individuals collectively,
`followed by Beagle42 using linkage disequilibrium (LD)
`information for genotype calling (black). For each of
`several quality thresholds, genotype calls with quality
`greater than the threshold were compared to HapMap
`data. Every such threshold thus entails both a proportion
`of called HapMap data and accuracy, relative to HapMap.
`For high call rates, genotyping the individuals
`collectively and using the LD-based method Beagle
`provided marked improvements.
`
`R E V I E W S
`
`CEU individuals
`One of the 11 populations
`in HapMap phase three. It
`consists of Utah residents
`with Northern and Western
`European ancestry from
`the Centre d’Etude du
`Polymorphisme Humain
`(CEPH) collection.
`
`Bayes’ formula
`A mathematical expression
`showing that a posterior
`probability can be found
`as the prior probability
`multiplied by the likelihood
`divided by a constant.
`
`Correlated errors
`Errors that do not occur
`independently of each other.
`An error that is observed in
`one position might increase the
`chance of observing another
`error in a neighbouring position.
`
`446 | JUNE 2011 | VOLUME 12
`
` www.nature.com/reviews/genetics
`
`© 2011 Macmillan Publishers Limited. All rights reserved
`
`Page 446
`
`

`

`R E V I E W S
`
`Table 1 | A list of available non-commercial NGS genotype-calling software
`Software Available from
`Calling method Prerequisites
`Comments
`Package for NGS data analysis, which includes a single
`SOAP2
`http://soap.genomics.org.
`Single-sample
`High-quality variant
`individual genotype caller (SOAPsnp)
`cn/index.html
`database (for
`example, dbSNP)
`Aligned reads
`
`realSFS
`
`http://128.32.118.212/
`thorfinn/realSFS/
`
`Single-sample
`
`Samtools
`
`http://samtools.
`sourceforge.net/
`
`Multi-sample
`
`Aligned reads
`
`Software for SNP and genotype calling using single
`individuals and allele frequencies. Site frequency
`spectrum (SFS) estimation
`Package for manipulation of NGS alignments, which
`includes a computation of genotype likelihoods
`(samtools) and SNP and genotype calling (bcftools)
`Package for aligned NGS data analysis, which includes
`a SNP and genotype caller (Unifed Genotyper),
`SNP filtering (Variant Filtration) and SNP quality
`recalibration (Variant Recalibrator)
`Software for imputation, phasing and association that
`includes a mode for genotype calling
`
`Software for imputation and phasing, including
`a mode for genotype calling. Requires fine-scale
`linkage map
`Software for SNP and genotype calling, including a
`method for generating candidate SNPs without LD
`information (NLDA) and a method for incorporating
`LD information (LDA). The ‘feasible’ genealogies can
`be generated using Margarita (http://www.sanger.
`ac.uk/resources/software/margarita)
`Software for SNP and genotype calling, including a
`method (GPT_Freq) for generating candidate SNPs
`without LD information and a method (thunder_glf_freq)
`for incorporating LD information
`A more complete list is available from http://seqanswers.com/wiki/Software/list. LD, linkage disequilibrium; NGS, next-generation sequencing.
`
`GATK
`
`Beagle
`
`IMPUTE2
`
`QCall
`
`http://www.
`broadinstitute.org/gsa/
`wiki/index.php/The_
`Genome_Analysis_Toolkit
`http://faculty.washington.
`edu/browning/beagle/
`beagle.html
`http://mathgen.stats.
`ox.ac.uk/impute/
`impute_v2.html
`ftp://ftp.sanger.ac.uk/pub/
`rd/QCALL
`
`Multi-sample
`
`Aligned reads
`
`Multi-sample LD Candidate
`SNPs, genotype
`likelihoods
`Multi-sample LD Candidate
`SNPs, genotype
`likelihoods
`Multi-sample LD ‘Feasible’
`genealogies at
`a dense set of
`loci, genotype
`likelihoods
`
`MaCH
`
`http://genome.sph.umich.
`edu/wiki/Thunder
`
`Multi-sample LD Genotype
`likelihoods
`
`Refs
`15
`
`-
`
`53
`
`32,33
`
`42
`
`44
`
`54
`
`-
`
`Assigning priors using single samples. In addition to
`computing the genotype likelihood, a prior probability
`for each genotype must be assumed in order to produce
`posterior probabilities for the genotypes. Suppose that a
`single individual is sequenced. The prior-genotype prob-
`ability may be chosen to assign equal probability to all
`genotypes, or it can be based on external information
`— for example, from the reference sequence, SNP data-
`bases or an available population sample. In the analysis
`of human data in SOAPsnp14,15, a prior is chosen by the
`use of dbSNP38. For example, if a G/T polymorphism
`is reported in dbSNP, the prior probabilities are set to
`be 0.454 for each of the genotypes GG and TT, 0.0909
`for GT and less than 10−4 for all other genotypes14,15. A
`similar approach is used in MAQ12. Notice that there is
`a strong weight against heterozygotes in order to avoid
`mistaking sequencing errors for real polymorphisms.
`
`Assigning priors using multiple samples. Priors can be
`improved by j

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket