`
` S T U DY D E S I G N S
`
`Genotype and SNP calling from
`next-generation sequencing data
`
`Rasmus Nielsen*‡§, Joshua S. Paul||, Anders Albrechtsen‡ and Yun S. Song§||
`
`Abstract | Meaningful analysis of next-generation sequencing (NGS) data, which are
`produced extensively by genetics and genomics studies, relies crucially on the accurate
`calling of SNPs and genotypes. Recently developed statistical methods both improve
`and quantify the considerable uncertainty associated with genotype calling, and will
`especially benefit the growing number of studies using low- to medium-coverage data.
`We review these methods and provide a guide for their use in NGS studies.
`
`Likelihoods
`Functions expressing the
`probability of observing
`the data — for example,
`next-generation sequencing
`data — given a parameter,
`such as a genotype or an
`allele frequency.
`
`*Department of Integrative
`Biology, University of
`California, Berkeley,
`California 94720, USA.
`‡Centre for Bioinformatics,
`University of Copenhagen,
`Universitetsparken 15, 2100
`Copenhagen Ø, Denmark.
`§Department of Statistics,
`University of California,
`Berkeley, California 94720,
`USA.
`||Department of Electrical
`Engineering and Computer
`Sciences, University of
`California, Berkeley,
`California 94720, USA.
`Correspondence to R.N.
`and Y.S.S.
`e‑mails: rasmus_nielsen@
`berkeley.edu; yss@stat.
`berkeley.edu
`doi:10.1038/nrg2986
`
`Next-generation sequencing (NGS) methods1 provide
`cheap and reliable large-scale DNA sequencing. They
`are used extensively for de novo sequencing2, for disease
`mapping3, for quantifying expression levels through RNA
`sequencing4–6 and in population genetic studies7–9.
`In NGS methods, a whole genome, or targeted
`regions of the genome, is randomly digested into small
`fragments (or short reads) that get sequenced and are
`then either aligned to a reference genome or assembled10.
`Having aligned the fragments of one or more individuals
`to a reference genome, ‘SNP calling’ identifies variable
`sites, whereas ‘genotype calling’ determines the genotype
`for each individual at each site.
`NGS data can suffer from high error rates that are
`due to multiple factors, including base-calling and align-
`ment errors. Moreover, many NGS studies rely on low-
`coverage sequencing (<5× per site per individual, on
`average), for which there is high probability that only one
`of the two chromosomes of a diploid individual has been
`sampled at a specified site. Under such circumstances,
`accurate SNP calling and genotype calling are difficult,
`and there is often considerable uncertainty associated
`with the results. It is crucial to quantify and account for
`this uncertainty, as it influences downstream analyses
`based on the inferred SNPs and genotypes, such as the
`identification of rare mutations, the estimation of allele
`frequencies and association mapping.
`One method for reducing uncertainty associated with
`genotype and SNP calling is to sequence target regions
`deeply (at >20× coverage). However, the ever-increasing
`demand for larger samples suggests that medium-
`(5–20×) or low-coverage sequencing will be the most
`common and cost-effective study design in many appli-
`cations of NGS for years to come. For example, the 1000
`Genomes Project pilot phase9 relied on approximately
`
`3× coverage to sequence 176 individuals genome-wide.
`For the identification of low-frequency variants, this
`design is more cost-efficient than deeper sequencing in
`fewer individuals. Likewise, in association studies, map-
`ping power is typically maximized by sequencing many
`individuals at low depth11, rather than sequencing
`fewer individuals at a high depth.
`Alternatively, reducing and quantifying the uncer-
`tainty associated with SNP and genotype calling may
`be accomplished using sophisticated algorithms; there-
`fore, these have recently been the subject of extensive
`research9,12–15. Most contemporary algorithms use a
`probabilistic framework. So-called ‘genotype likelihoods’
`— which incorporate errors that may have been intro-
`duced in base calling, alignment and assembly — are
`coupled with prior information, such as allele frequen-
`cies and patterns of linkage disequilibrium (LD). The
`result is a SNP and genotype call and an associated
`measure of uncertainty (which is often described by a
`‘quality score’), both of which have a concrete statistical
`interpretation.
`Here we review this research and provide general
`guidelines for genotype and SNP calling in NGS stud-
`ies. Converting the raw output of NGS technology into a
`final set of SNP and genotype data involves a number of
`steps (summarized in FIG. 1), each of which contributes
`to the accuracy of the final SNP and genotype calls. We
`start at the beginning of this process by briefly review-
`ing recent developments in the methods used for base
`calling and alignment. We then review and discuss sev-
`eral recent algorithms for SNP and genotype calling and
`address how the uncertainties in the resulting calls can
`be accommodated in downstream analyses. Finally, we
`make some general recommendations for the analysis
`of NGS data.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 12 | JUNE 2011 | 443
`
`© 2011 Macmillan Publishers Limited. All rights reserved
`
`Page 443
`
`FOUNDATION EXHIBIT 1030
`IPR2019-00634
`
`
`
`R E V I E W S
`
`Base calling and alignment
`The main principle underlying NGS technologies is
`sequencing-by-synthesis. In brief, tens-to-hundreds
`of millions of clusters of small ssDNA templates are
`‘read’ simultaneously by sequentially building up com-
`plementary bases. The synthesis process is captured in
`a series of fluorescence images, and base-calling algo-
`rithms infer the actual nucleotide information from the
`obtained fluorescence-intensity data for each cluster of
`DNA templates. They then assign a measure of uncer-
`tainty (or quality score) to each base call. The result-
`ing short-read data are then assembled into a genome.
`When a reference genome is available, the primary
`approach used to assemble a newly sequenced genome
`is alignment (also known as ‘read mapping’), in which
`the basic task is to align each short read onto an avail-
`able reference genome. Here we review the key aspects
`of base calling and alignment.
`
`(cid:43)(cid:79)(cid:67)(cid:73)(cid:71)(cid:2)(cid:67)(cid:80)(cid:67)(cid:78)(cid:91)(cid:85)(cid:75)(cid:85)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)(cid:68)(cid:67)(cid:85)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:75)(cid:80)(cid:73)
`
`(cid:52)(cid:71)(cid:67)(cid:70)(cid:2)(cid:79)(cid:67)(cid:82)(cid:82)(cid:75)(cid:80)(cid:73)
`
`(cid:52)(cid:71)(cid:67)(cid:78)(cid:75)(cid:73)(cid:80)(cid:14)(cid:2)(cid:84)(cid:71)(cid:79)(cid:81)(cid:88)(cid:71)(cid:2)(cid:70)(cid:87)(cid:82)(cid:78)(cid:75)(cid:69)(cid:67)(cid:86)(cid:71)(cid:2)(cid:84)(cid:71)(cid:67)(cid:70)(cid:85)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)(cid:84)(cid:71)(cid:69)(cid:67)(cid:78)(cid:75)(cid:68)(cid:84)(cid:67)(cid:86)(cid:71)(cid:2)(cid:83)(cid:87)(cid:67)(cid:78)(cid:75)(cid:86)(cid:91)(cid:2)(cid:85)(cid:69)(cid:81)(cid:84)(cid:71)(cid:85)
`
`(cid:47)(cid:87)(cid:78)(cid:86)(cid:75)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:75)(cid:80)(cid:73)
`
`(cid:53)(cid:75)(cid:80)(cid:73)(cid:78)(cid:71)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:75)(cid:80)(cid:73)
`
`(cid:50)(cid:84)(cid:81)(cid:79)(cid:81)(cid:86)(cid:71)(cid:2)(cid:69)(cid:67)(cid:80)(cid:70)(cid:75)(cid:70)(cid:67)(cid:86)(cid:71)(cid:2)(cid:53)(cid:48)(cid:50)(cid:2)
`(cid:85)(cid:71)(cid:86)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)(cid:73)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:85)(cid:2)(cid:87)(cid:85)(cid:75)(cid:80)(cid:73)(cid:2)
`(cid:80)(cid:81)(cid:80)(cid:15)(cid:78)(cid:75)(cid:80)(cid:77)(cid:67)(cid:73)(cid:71)(cid:15)(cid:68)(cid:67)(cid:85)(cid:71)(cid:70)(cid:14)(cid:2)(cid:79)(cid:87)(cid:78)(cid:86)(cid:75)(cid:15)
`(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)(cid:2)(cid:67)(cid:80)(cid:67)(cid:78)(cid:91)(cid:85)(cid:75)(cid:85)(cid:2)
`
`(cid:52)(cid:71)(cid:558)(cid:80)(cid:71)(cid:2)(cid:69)(cid:67)(cid:80)(cid:70)(cid:75)(cid:70)(cid:67)(cid:86)(cid:71)(cid:2)(cid:53)(cid:48)(cid:50)(cid:2)(cid:85)(cid:71)(cid:86)(cid:2)
`(cid:67)(cid:80)(cid:70)(cid:2)(cid:73)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:75)(cid:80)(cid:73)(cid:2)(cid:87)(cid:85)(cid:75)(cid:80)(cid:73)(cid:2)
`(cid:78)(cid:75)(cid:80)(cid:77)(cid:67)(cid:73)(cid:71)(cid:15)(cid:68)(cid:67)(cid:85)(cid:71)(cid:70)(cid:2)(cid:67)(cid:80)(cid:67)(cid:78)(cid:91)(cid:85)(cid:75)(cid:85)
`
`(cid:43)(cid:70)(cid:71)(cid:80)(cid:86)(cid:75)(cid:72)(cid:91)(cid:2)(cid:53)(cid:48)(cid:50)(cid:85)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)
`(cid:67)(cid:85)(cid:85)(cid:81)(cid:69)(cid:75)(cid:67)(cid:86)(cid:71)(cid:70)(cid:2)(cid:73)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:85)(cid:2)(cid:87)(cid:85)(cid:75)(cid:80)(cid:73)(cid:2)
`(cid:85)(cid:75)(cid:80)(cid:73)(cid:78)(cid:71)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)(cid:2)(cid:67)(cid:80)(cid:67)(cid:78)(cid:91)(cid:85)(cid:75)(cid:85)
`
`(cid:53)(cid:48)(cid:50)(cid:2)(cid:558)(cid:78)(cid:86)(cid:71)(cid:84)(cid:75)(cid:80)(cid:73)(cid:2)(cid:67)(cid:80)(cid:70)(cid:2)(cid:53)(cid:48)(cid:50)(cid:2)(cid:81)(cid:84)(cid:2)(cid:73)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:2)(cid:83)(cid:87)(cid:67)(cid:78)(cid:75)(cid:86)(cid:91)(cid:2)(cid:85)(cid:69)(cid:81)(cid:84)(cid:71)(cid:2)(cid:84)(cid:71)(cid:69)(cid:67)(cid:78)(cid:75)(cid:68)(cid:84)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80)
`
`Figure 1 | Steps for converting raw next-generation
`sequencing data into a final set of SNP or genotype
`calls. Pre-processing steps (shown in yellow) transform
`(cid:48)(cid:67)(cid:86)(cid:87)(cid:84)(cid:71)(cid:2)(cid:52)(cid:71)(cid:88)(cid:75)(cid:71)(cid:89)(cid:85)(cid:2)(cid:94)(cid:2)(cid:41)(cid:71)(cid:80)(cid:71)(cid:86)(cid:75)(cid:69)(cid:85)
`the raw data from next-generation sequencing technology
`into a set of aligned reads that have a measure of
`confidence, or quality score, associated with the bases
`of each read. The per-base quality scores produced by
`base-calling algorithms may need to be recalibrated to
`accurately reflect the true error rates. Depending on the
`number of samples and the depth of coverage, either a
`multi-sample calling procedure (green) or a single-sample
`calling procedure (orange) may then be applied to obtain
`SNP or genotype calls and associated quality scores.
`Note that the multi-sample procedure may include a
`linkage-based analysis, which can substantially improve
`the accuracy of SNP or genotype calls. Finally,
`post-processing (purple) uses both known data and
`simple heuristics to filter the set of SNPs and/or improve
`the associated quality scores. Optional, although
`recommended, steps are shown in dashed lines.
`
`Base calling and quality scores. Base-calling procedures
`vary according to the sequencing platform used, all of
`which are prone to a different type of error. For the 454
`platform, base calling involves inferring the length of
`each homopolymer from the observed fluorescence
`intensity. The main challenge stems from the fact that
`the variance of signal intensity for a specific homopoly-
`mer length is large, resulting in high error rates in inser-
`tion and deletion (indel) calls. For the Illumina platform,
`indel errors are rare, but the overall miscall error rate
`is typically around 1%. Here, the main complication
`arises from the synthesis process becoming desyn-
`chronized between different copies of DNA templates
`in the same cluster. Base calling becomes less accurate in
`later cycles as the extent of asynchrony is exacerbated
`with each sequencing cycle. The SOLiD platform uses
`a two-base encoding scheme in which each fluorescent
`dye colour represents four dinucleotide combinations.
`Each base of the DNA template is examined twice
`in this system and a length m nucleotide sequence is
`represented as a length m – 1 colour sequence. A
`major complication in ‘colour calling’ arises from
`biases in fluorescence intensities that appear in later
`machine cycles.
`In addition to identifying nucleotides, base-calling
`algorithms produce per-base quality scores by using
`noise estimates from image analysis. Some sequencing
`platforms adopt quality values that are defined specifi-
`cally for the platforms, but those quality values can be
`easily transformed into the standard Phred16 quality
`score, given by
`
` QPhred = -10 log10 P(error).
`
`
`
`(1)
`
`Note that a Phred score of 20 corresponds to a 1% error
`rate in base calling.
`The typical error rate of NGS data ranges from a few
`tenths of a per cent to several per cent, depending on
`the platform. Reducing the error rate of base calls and
`improving the accuracy of the per-base quality score
`have important implications for assembly, polymor-
`phism detection and downstream population-genomic
`analyses. As such, several base-calling algorithms have
`been developed to optimize data acquisition for the
`more widely used NGS platforms: examples include
`Pyrobayes17 for the 454 platform; Rsolid18 for the
`SOLiD platform; and Ibis19 and BayesCall20,21 for
`the Illumina platform. These algorithms provide
`~5–30% improvement in error rates over the base-
`calling methods developed by the manufacturers of the
`NGS platforms, and it has been shown that improved
`base-call accuracy can lead to a significant reduction in
`false-positive SNP calls and facilitate assembly when the
`coverage is low to moderate. However, some of the new
`methods tend to be either too computationally inten-
`sive to be of broad practical use or have not been tested
`thoroughly. Although more-accurate image analysis and
`base-calling algorithms for NGS platforms continue to
`be developed, the default software packages currently
`accompanying NGS platforms are the ones that are most
`widely adopted by users.
`
`444 | JUNE 2011 | VOLUME 12
`
` www.nature.com/reviews/genetics
`
`© 2011 Macmillan Publishers Limited. All rights reserved
`
`Page 444
`
`
`
`R E V I E W S
`
`Alignment and assembly. The accuracy of the align-
`ment has a crucial role in variant detection. Incorrectly
`aligned reads may lead to errors in SNP and genotype
`calling, so it is important for alignment algorithms to
`be able to cope with sequencing errors, as well as with
`potentially real differences (both point mutations and
`indels) between the reference genome and the sequenced
`genome that are due to polymorphisms. Furthermore,
`it is important for aligners to produce well-calibrated
`alignment (or mapping) quality values, as variant calls
`and their posterior probabilities depend on those scores.
`The amount of sequence identity required between
`each read and the reference sequence is determined by a
`trade-off between accuracy and read depth. The optimal
`choice of the tolerable number of mismatches may differ
`between different organisms. For example, as popula-
`tions of Drosophila melanogaster are more variable than
`human populations, using mapping criteria that are opti-
`mized for analyses of human sequences may lead to a
`severe loss of sequencing depth in D. melanogaster. This,
`in turn, may lead to a potential for biases in the down-
`stream analyses, as regions that harbour many natural
`polymorphisms will be underrepresented. Likewise,
`using alignment criteria that are appropriate for fruitflies
`in humans would lead to a large amount of incorrectly
`aligned reads.
`Most alignment algorithms for NGS data are based
`on either ‘hashing’ or an effective data compression algo-
`rithm called the ‘Burrows–Wheeler transform’ (BWT)22.
`BWT-based aligners (for example, Bowtie23, SOAP2
`(REF. 15) and BWA24) are fast, memory-efficient and par-
`ticularly useful for aligning repetitive reads; however,
`they tend to be less sensitive than the state-of-the-art
`hash-based algorithms (for example, MAQ12, Novoalign
`and Stampy25). The Novoalign and Stampy aligners cur-
`rently produce the most accurate overall results, while
`also being practical in terms of running time (see REF. 25
`for a detailed comparison of the performance of various
`aligners).
`In general, alignment is more difficult for regions
`with higher levels of diversity between the reference
`genome and the sequenced genome. This difficulty can
`be ameliorated by the use of longer reads and paired-
`end reads (see REF. 25 for further quantitative details).
`However, assembling highly diverse regions such as
`the major histocompatibility complex (MHC) remains
`a challenge. Using de novo assembly algorithms —
`which are based on graph-theoretic ideas26–30 that try to
`exploit overlap information to stitch together the reads
`into contiguous sequences — may provide a viable
`solution to this challenge. Combining such methods
`with alignment to study genetic variation in complex
`regions is likely to be an active area of research in the
`forthcoming years.
`
`Recalibration of per-base quality scores. The raw
`Phred-scaled quality scores produced by base-calling
`algorithms may not accurately reflect the true base-
`calling error rates14,15,31. In such a case, the raw quality
`scores need to be recalibrated so that a Phred score of Q
`more-accurately corresponds to an error rate of 10- Q/10,
`
`as implied by equation 1. Obtaining well-calibrated qual-
`ity scores is important, as SNP and genotype calling at
`a specific position in the genome depends on both the
`base calls and the per-base quality scores of the reads
`overlapping the position.
`In SOAPsnp14,15, per-base quality scores are recali-
`brated by comparing a sequenced genome to the refer-
`ence genome at sites with no known SNPs. A related
`alignment-based recalibration algorithm has been
`implemented in the GATK software32,33, which takes
`into account several covariates such as machine cycle
`and dinucleotide context. For all supposedly non-
`polymorphic sites, the bases that align to those sites are
`put into different categories classified by the following
`features: the raw quality score (produced by base call-
`ing), the position of the base in the read, the dinucle-
`otide context and the read group. For each category, the
`algorithm estimates the empirical quality score by using
`the number of mismatches with respect to the reference
`genome. Recalibrated quality scores are then estimated
`by adding to the raw quality scores the residual differ-
`ences between empirical quality scores and the mis-
`match rates implied by the raw quality scores, which
`are conditioned on various subsets of the features. This
`recalibration algorithm, which is adopted in the 1000
`Genomes Project9, can be applied to various sequenc-
`ing platforms. As described above, the algorithm uses a
`set of supposedly non-polymorphic sites. If a compre-
`hensive database of known SNPs is not available for the
`species under consideration, then one can first identify
`candidate polymorphic sites that are highly likely to
`be real and use the remaining sites in the recalibration
`procedure. In such a case, another round of SNP calling
`should be performed with recalibrated quality scores.
`
`Genotype and SNP calling
`The process of converting base calls and quality scores
`into a set of genotypes for each individual in a sample is
`often divided into two steps: genotype calling and SNP
`calling. SNP calling aims to determine in which posi-
`tions there are polymorphisms or in which positions at
`least one of the bases differs from a reference sequence;
`the latter is also sometimes referred to as ‘variant call-
`ing’. Genotype calling is the process of determining
`the genotype for each individual and is typically only
`done for positions in which a SNP or a ‘variant’ has
`already been called. We use the word ‘calling’ here to
`signify the estimation of one unique SNP or genotype.
`However, we note that some analyses can proceed with-
`out determining the exact identity of each genotype,
`but instead allow uncertainty regarding genotypes to be
`incorporated directly into the analyses.
`Genotype and SNP calling can proceed, as in early
`studies, by counting alleles at each site and using simple
`cutoff rules for when to call a SNP or genotype. More-
`recent methods incorporate uncertainty in a probabilis-
`tic framework. In the probabilistic framework, it is also
`possible to further incorporate additional information
`regarding allele frequencies and/or patterns of LD. We
`review these different approaches below, starting with
`the simple methods based on counting alleles.
`
`Posterior probabilities
`In this context, these are
`the probabilities of a
`particular genotype given
`observed data: they are
`calculated by incorporating
`the information from the
`next-generation sequencing
`data as well as some prior
`information.
`
`Hashing
`A procedure of creating a
`data structure that helps to
`accelerate alignment. It stores
`information about which reads
`or where in the reference
`genome a particular substring
`or subsequence occurs. Some
`hash-based aligners hash the
`reads, while others hash
`the reference genome.
`
`Paired-end reads
`Sequencing of both the
`forward and reverse template
`of a DNA molecule, which is
`possible by inserting a primer
`sequence between the two
`ends of the read. The use of
`this technique greatly helps
`to increase assembly and
`alignment accuracy.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 12 | JUNE 2011 | 445
`
`© 2011 Macmillan Publishers Limited. All rights reserved
`
`Page 445
`
`
`
`to a loss of information regarding individual read quali-
`ties. An additional disadvantage of this type of genotype
`calling is that it typically does not provide measures of
`uncertainty in the genotype inference. For this reason,
`several probabilistic methods have been developed that
`use the quality score to provide a posterior probability
`for each genotype12–15.
`In brief, it is assumed that one can compute a geno-
`type likelihood, p(X | G), for a genotype G. The symbol
`X represents, in this generic notation, all of the read data
`for a particular individual at a particular site. In conjunc-
`tion with a genotype prior, p(G), Bayes’ formula is used
`to calculate p(G | X), which is the posterior probability
`of genotype G. The genotype with the highest posterior
`probability is generally chosen, and this probability, or
`perhaps the ratio between the highest and the second
`highest probabilities, is used as a measure of confidence.
`The advantages of the probabilistic methods are that
`they provide measures of statistical uncertainty when
`calling genotypes, they lead to higher accuracy of geno-
`type calling, and they provide a natural framework for
`incorporating information regarding allele frequencies
`and patterns of LD.
`
`Calculating genotype likelihoods. The genotype like-
`lihood can be calculated using the quality scores for
`each read. Let Xi be the data in read i for a particular
`individual and a particular site with genotype G. The
`probability p(Xi | G) is then given by a simple rescaling
`of the quality score of Xi, and the genotype likelihood,
`p(X | G), can be calculated directly from the data by
`taking the product of p(Xi | G) over all i. There is an
`implicit assumption here of independence among reads,
`which may be violated in the presence of alignment
`errors or PCR artefacts. It has been suggested12 that a
`weighting scheme should be used that takes correlated
`errors into account. The genotype likelihoods can also
`be improved by recalibrating the per-base quality scores
`using empirical data, as discussed in the section regard-
`ing base calling and alignment. When genotype calling
`is preceded by SNP calling, the information from the
`SNP-calling step can be incorporated into the genotype-
`calling algorithm, leading to genotype likelihoods that
`are calculated by conditioning on the site containing a
`polymorphism.
`Martin et al.37 also suggested estimating error rates
`directly from the read data for each site independ-
`ently, instead of using quality scores. The advantage of
`such an approach is that genotype and SNP calling do
`not depend on the accuracy of the calculated quality
`scores. However, a disadvantage is that the considerable
`information regarding errors gained through the base-
`calling and alignment process is lost. These approaches
`are very recent, and no research has been done to sys-
`tematically compare the advantages and disadvantages
`of directly estimating error rates from the data. Ideally,
`error rates should be estimated from the data while
`incorporating information obtained during base calling
`and alignment.
`Methods for calculating genotype likelihoods will
`probably be a topic of much future research.
`
`Early methods for calling genotypes. Early NGS studies
`based both SNP and genotype calling on separate
`analyses of data from each individual sampled. Typically,
`analyses would first involve a filtering step in which only
`high-confidence bases would be kept. The most common
`cutoff used would be a Phred-type quality score of Q20
`(QPhred = 20). Genotype calling would then proceed for
`each individual by counting the number of times each
`allele is observed and using fixed cutoffs. SNP calling
`would then be performed based on the inferred geno-
`types. For example34,35, one would first use a Q20 filter
`and then call a heterozygous genotype if the proportion
`of the non-reference allele is between 20% and 80%; oth-
`erwise, a homozygous genotype would be called. This
`is a fairly standard procedure and works well when the
`sequencing depth is high (>20×), so that the probability
`of a heterozygous individual falling outside the 20–80%
`range is small. Related methods for genotype calling
`form the basis for the commercially available software
`in Roche’s GSMapper, the CLC Genomic Workbench
`and the DNSTAR Lasergene software. These methods
`can be improved by using more empirically determined
`cutoffs (as described in REF. 36).
`
`Probabilistic methods. For moderate or low sequencing
`depths, genotype calling based on fixed cutoffs will typi-
`cally lead to under-calling of heterozygous genotypes and
`the use of a simple filtering based on quality score leads
`
`(cid:41)(cid:35)(cid:54)(cid:45)(cid:115)(cid:36)(cid:71)(cid:67)(cid:73)(cid:78)(cid:71)
`(cid:41)(cid:35)(cid:54)(cid:45)(cid:2)(cid:79)(cid:87)(cid:78)(cid:86)(cid:75)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)
`(cid:41)(cid:35)(cid:54)(cid:45)(cid:2)(cid:85)(cid:75)(cid:80)(cid:73)(cid:78)(cid:71)(cid:15)(cid:85)(cid:67)(cid:79)(cid:82)(cid:78)(cid:71)
`
`(cid:19)(cid:16)(cid:18)(cid:18)
`
`(cid:18)(cid:16)(cid:27)(cid:23)
`
`(cid:18)(cid:16)(cid:27)(cid:18)
`
`(cid:18)(cid:16)(cid:26)(cid:23)
`
`(cid:18)(cid:16)(cid:26)(cid:18)
`
`(cid:41)(cid:71)(cid:80)(cid:81)(cid:86)(cid:91)(cid:82)(cid:71)(cid:2)(cid:69)(cid:67)(cid:78)(cid:78)(cid:2)(cid:67)(cid:69)(cid:69)(cid:87)(cid:84)(cid:67)(cid:69)(cid:91)
`
`(cid:18)
`
`(cid:18)(cid:16)(cid:26)
`
`(cid:18)(cid:16)(cid:22)
`(cid:18)(cid:16)(cid:20)
`(cid:18)(cid:16)(cid:24)
`(cid:50)(cid:84)(cid:81)(cid:82)(cid:81)(cid:84)(cid:86)(cid:75)(cid:81)(cid:80)(cid:2)(cid:81)(cid:72)(cid:2)(cid:80)(cid:81)(cid:80)(cid:15)(cid:69)(cid:67)(cid:78)(cid:78)(cid:85)
`(cid:48)(cid:67)(cid:86)(cid:87)(cid:84)(cid:71)(cid:2)(cid:52)(cid:71)(cid:88)(cid:75)(cid:71)(cid:89)(cid:85)(cid:2)(cid:94)(cid:2)(cid:41)(cid:71)(cid:80)(cid:71)(cid:86)(cid:75)(cid:69)(cid:85)
`Figure 2 | A comparison of three genotype callers.
`A subset of the data (chromosome 20, bases 20,000,000–
`25,000,000) for the 62 CEU individuals in both the
`HapMap Public Release no. 28 and the 1000 Genomes
`Pilot Project was genotype-called using the following
`methods: GATK Unified Genotyper32,33 applied to each
`individual independently (blue); GATK Unified Genotyper
`applied to all individuals collectively (red); and GATK
`Unified Genotyper applied to all individuals collectively,
`followed by Beagle42 using linkage disequilibrium (LD)
`information for genotype calling (black). For each of
`several quality thresholds, genotype calls with quality
`greater than the threshold were compared to HapMap
`data. Every such threshold thus entails both a proportion
`of called HapMap data and accuracy, relative to HapMap.
`For high call rates, genotyping the individuals
`collectively and using the LD-based method Beagle
`provided marked improvements.
`
`R E V I E W S
`
`CEU individuals
`One of the 11 populations
`in HapMap phase three. It
`consists of Utah residents
`with Northern and Western
`European ancestry from
`the Centre d’Etude du
`Polymorphisme Humain
`(CEPH) collection.
`
`Bayes’ formula
`A mathematical expression
`showing that a posterior
`probability can be found
`as the prior probability
`multiplied by the likelihood
`divided by a constant.
`
`Correlated errors
`Errors that do not occur
`independently of each other.
`An error that is observed in
`one position might increase the
`chance of observing another
`error in a neighbouring position.
`
`446 | JUNE 2011 | VOLUME 12
`
` www.nature.com/reviews/genetics
`
`© 2011 Macmillan Publishers Limited. All rights reserved
`
`Page 446
`
`
`
`R E V I E W S
`
`Table 1 | A list of available non-commercial NGS genotype-calling software
`Software Available from
`Calling method Prerequisites
`Comments
`Package for NGS data analysis, which includes a single
`SOAP2
`http://soap.genomics.org.
`Single-sample
`High-quality variant
`individual genotype caller (SOAPsnp)
`cn/index.html
`database (for
`example, dbSNP)
`Aligned reads
`
`realSFS
`
`http://128.32.118.212/
`thorfinn/realSFS/
`
`Single-sample
`
`Samtools
`
`http://samtools.
`sourceforge.net/
`
`Multi-sample
`
`Aligned reads
`
`Software for SNP and genotype calling using single
`individuals and allele frequencies. Site frequency
`spectrum (SFS) estimation
`Package for manipulation of NGS alignments, which
`includes a computation of genotype likelihoods
`(samtools) and SNP and genotype calling (bcftools)
`Package for aligned NGS data analysis, which includes
`a SNP and genotype caller (Unifed Genotyper),
`SNP filtering (Variant Filtration) and SNP quality
`recalibration (Variant Recalibrator)
`Software for imputation, phasing and association that
`includes a mode for genotype calling
`
`Software for imputation and phasing, including
`a mode for genotype calling. Requires fine-scale
`linkage map
`Software for SNP and genotype calling, including a
`method for generating candidate SNPs without LD
`information (NLDA) and a method for incorporating
`LD information (LDA). The ‘feasible’ genealogies can
`be generated using Margarita (http://www.sanger.
`ac.uk/resources/software/margarita)
`Software for SNP and genotype calling, including a
`method (GPT_Freq) for generating candidate SNPs
`without LD information and a method (thunder_glf_freq)
`for incorporating LD information
`A more complete list is available from http://seqanswers.com/wiki/Software/list. LD, linkage disequilibrium; NGS, next-generation sequencing.
`
`GATK
`
`Beagle
`
`IMPUTE2
`
`QCall
`
`http://www.
`broadinstitute.org/gsa/
`wiki/index.php/The_
`Genome_Analysis_Toolkit
`http://faculty.washington.
`edu/browning/beagle/
`beagle.html
`http://mathgen.stats.
`ox.ac.uk/impute/
`impute_v2.html
`ftp://ftp.sanger.ac.uk/pub/
`rd/QCALL
`
`Multi-sample
`
`Aligned reads
`
`Multi-sample LD Candidate
`SNPs, genotype
`likelihoods
`Multi-sample LD Candidate
`SNPs, genotype
`likelihoods
`Multi-sample LD ‘Feasible’
`genealogies at
`a dense set of
`loci, genotype
`likelihoods
`
`MaCH
`
`http://genome.sph.umich.
`edu/wiki/Thunder
`
`Multi-sample LD Genotype
`likelihoods
`
`Refs
`15
`
`-
`
`53
`
`32,33
`
`42
`
`44
`
`54
`
`-
`
`Assigning priors using single samples. In addition to
`computing the genotype likelihood, a prior probability
`for each genotype must be assumed in order to produce
`posterior probabilities for the genotypes. Suppose that a
`single individual is sequenced. The prior-genotype prob-
`ability may be chosen to assign equal probability to all
`genotypes, or it can be based on external information
`— for example, from the reference sequence, SNP data-
`bases or an available population sample. In the analysis
`of human data in SOAPsnp14,15, a prior is chosen by the
`use of dbSNP38. For example, if a G/T polymorphism
`is reported in dbSNP, the prior probabilities are set to
`be 0.454 for each of the genotypes GG and TT, 0.0909
`for GT and less than 10−4 for all other genotypes14,15. A
`similar approach is used in MAQ12. Notice that there is
`a strong weight against heterozygotes in order to avoid
`mistaking sequencing errors for real polymorphisms.
`
`Assigning priors using multiple samples. Priors can be
`improved by j