throbber
Molecular Systems Biology 7; Article number 539; doi:10.1038/msb.2011.75
`Citation: Molecular Systems Biology 7: 539
`& 2011 EMBO and Macmillan Publishers Limited All rights reserved 1744-4292/11
`www.molecularsystemsbiology.com
`REPORT
`
`Fast, scalable generation of high-quality protein
`multiple sequence alignments using Clustal Omega
`
`Fabian Sievers1,8, Andreas Wilm2,8, David Dineen1, Toby J Gibson3, Kevin Karplus4, Weizhong Li5, Rodrigo Lopez5,
`Hamish McWilliam5, Michael Remmert6, Johannes So¨ ding6, Julie D Thompson7 and Desmond G Higgins1,*
`
`1 School of Medicine and Medical Science, UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland,
`2 Computational and Systems Biology, Genome Institute of Singapore, Singapore, 3 Structural and Computational Biology Unit, European Molecular Biology
`Laboratory, Heidelberg, Germany, 4 Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA, 5 EMBL Outstation—European
`Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, 6 Gene Center Munich, University of Munich (LMU), Muenchen, Germany and
`7 De´partement de Biologie Structurale et Ge´ nomique, IGBMC (Institut de Ge´ ne´ tique et de Biologie Mole´ culaire et Cellulaire), CNRS/INSERM/Universite´ de Strasbourg,
`Illkirch, France
`8 These authors contributed equally to this work
`* Corresponding author. UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland. Tel.: þ 353 1 716 6833;
`Fax: þ 353 1 716 6713; E-mail: des.higgins@ucd.ie
`
`Received 23.7.11; accepted 6.9.11
`
`Multiple sequence alignments are fundamental to many sequence analysis methods. Most
`alignments are computed using the progressive alignment heuristic. These methods are starting
`to become a bottleneck in some analysis pipelines when faced with data sets of the size of many
`thousands of sequences. Some methods allow computation of larger data sets while sacrificing
`quality, and others produce high-quality alignments, but scale badly with the number of sequences.
`In this paper, we describe a new program called Clustal Omega, which can align virtually
`any number of protein sequences quickly and that delivers accurate alignments. The accuracy
`of the package on smaller test cases is similar to that of the high-quality aligners. On larger data
`sets, Clustal Omega outperforms other packages in terms of execution time and quality.
`Clustal Omega also has powerful features for adding sequences to and exploiting information in
`existing alignments, making use of the vast amount of precomputed information in public databases
`like Pfam.
`Molecular Systems Biology 7: 539; published online 11 October 2011; doi:10.1038/msb.2011.75
`Subject Categories: bioinformatics
`Keywords: bioinformatics; hidden Markov models; multiple sequence alignment
`
`Introduction
`
`Multiple sequence alignments (MSAs) are essential in most
`bioinformatics analyses that involve comparing homologous
`sequences. The exact way of computing an optimal alignment
`between N sequences has a computational complexity of
`O(LN) for N sequences of length L making it prohibitive for
`even small numbers of sequences. Most automatic methods
`are based on the ‘progressive alignment’ heuristic (Hogeweg
`and Hesper, 1984), which aligns sequences in larger and larger
`subalignments, following the branching order in a ‘guide tree.’
`With a complexity of roughly O(N2),
`this approach can
`routinely make alignments of a few thousand sequences of
`moderate length, but it is tough to make alignments much
`bigger than this. The progressive approach is a ‘greedy
`algorithm’ where mistakes made at the initial alignment
`stages cannot be corrected later. To counteract this effect, the
`consistency principle was developed (Notredame et al, 2000).
`This has allowed the production of a new generation of more
`accurate aligners (e.g. T-Coffee (Notredame et al, 2000)) but
`
`at the expense of ease of computation. These methods give
`5–10% more accurate alignments, as measured on benchmarks,
`but are confined to a few hundred sequences.
`In this report, we introduce a new program called Clustal
`Omega, which is accurate but also allows alignments of almost
`any size to be produced. We have used it
`to generate
`alignments of over 190 000 sequences on a single processor
`in a few hours. In benchmark tests, it is distinctly more
`accurate than most widely used, fast methods and comparable
`in accuracy to some of the intensive slow methods. It also has
`powerful features for allowing users to reuse their alignments
`so as to avoid recomputing an entire alignment, every time
`new sequences become available.
`The key to making the progressive alignment approach scale
`is the method used to make the guide tree. Normally, this
`involves aligning all N sequences to each other giving time and
`memory requirements of O(N2). Protein families with 450 000
`sequences are appearing and will become common from
`various wide scale genome sequencing projects. Currently, the
`only method that can routinely make alignments of more than
`
`& 2011 EMBO and Macmillan Publishers Limited
`
`Molecular Systems Biology 2011 1
`
`1 of 6
`
`OnCusp
`Ex. 1022
`
`

`

`High-quality protein MSAs using Clustal Omega
`F Sievers et al
`
`about 10 000 sequences is MAFFT/PartTree (Katoh and Toh,
`2007). It is very fast but leads to a loss in accuracy, which has
`to be compensated for by iteration and other heuristics. With
`Clustal Omega, we use a modified version of mBed (Black-
`shields et al, 2010), which has complexity of O(N log N), and
`which produces guide trees that are just as accurate as those
`from conventional methods. mBed works by ‘emBedding’ each
`sequence in a space of n dimensions where n is proportional to
`log N. Each sequence is then replaced by an n element vector,
`where each element is simply the distance to one of n ‘reference
`sequences.’ These vectors can then be clustered extremely
`quickly by standard methods such as K-means or UPGMA.
`In Clustal Omega, the alignments are then computed using the
`very accurate HHalign package (So¨ding, 2005), which aligns
`two profile hidden Markov models (Eddy, 1998).
`Clustal Omega has a number of
`features for adding
`sequences to existing alignments or
`for using existing
`alignments to help align new sequences. One innovation is
`to allow users to specify a profile HMM that is derived from an
`alignment of sequences that are homologous to the input set.
`The sequences are then aligned to these ‘external profiles’ to
`help align them to the rest of the input set. There are already
`widely available collections of HMMs from many sources such
`as Pfam (Finn et al, 2009) and these can now be used to help
`users to align their sequences.
`
`Results
`
`Alignment accuracy
`
`The standard method for measuring the accuracy of multiple
`alignment algorithms is to use benchmark test sets of reference
`alignments, generated with reference to three-dimensional
`structures. Here, we present results from a range of packages
`tested on three benchmarks: BAliBASE (Thompson et al,
`2005), Prefab (Edgar, 2004) and an extended version of
`HomFam (Blackshields et al, 2010). For these tests, we just
`report results using the default settings for all programs but
`with two exceptions, which were needed to allow MUSCLE
`(Edgar, 2004) and MAFFT to align the biggest test cases in
`
`Table I BAliBASE results
`
`HomFam. For test cases with 43000 sequences, we run
`MUSCLE with the –maxiter parameter set to 2, in order to finish
`the alignments in reasonable times. Second, we have run several
`different programs from the MAFFT package. MAFFT (Katoh
`et al, 2002) consists of a series of programs that can be run
`separately or called automatically from a script with the –auto
`flag set. This flag chooses to run a slow, consistency-based
`program (L-INS-i) when the number and lengths of sequences is
`small. When the numbers exceed inbuilt thresholds, a conven-
`tional progressive aligner is used (FFT-NS-2). The latter is also the
`program that is run by default if MAFFT is called with no flags
`set. For very large data sets, the –parttree flag must be set on the
`command line and a very fast guide tree calculation is then used.
`The results for the BAliBASE benchmark tests are shown in
`Table I. BAliBASE is divided into six ‘references.’Average scores
`are given for each reference, along with total run times and
`average total column (TC) scores, which give the proportion of
`the total alignment columns that is recovered. A score of 1.0
`indicates perfect agreement with the benchmark. There are two
`rows for the MAFFT package: MAFFT (auto) and MAFFT
`default. In most (203 out of 218) BAliBASE test cases, the
`number of sequences is small and the script runs L-INS-i, which
`is the slow accurate program that uses the consistency heuristic
`(Notredame et al, 2000) that is also used by MSAprobs
`(Liu et al, 2010), Probalign, Probcons (Do et al, 2005) and
`T-Coffee. These programs are all restricted to small numbers of
`sequences but tend to give accurate alignments. This is clearly
`reflected in the times and average scores in Table I. The times
`range from 25 min up to 22 h for these packages and the
`accuracies range from 55 to 61% of columns correct. Clustal
`Omega only takes 9 min for the same runs but has an accuracy
`level that is similar to that of Probcons and T-Coffee.
`The rest of the table is mainly taken by the programs that use
`progressive alignment. Some of these are very fast but this
`speed is matched by a considerable drop in accuracy compared
`with the consistency-based programs and Clustal Omega. The
`weakest program here, is Clustal W (Larkin et al, 2007)
`followed by PRANK (Lo¨ytynoja and Goldman, 2008). PRANK
`is not designed for aligning distantly related sequences but at
`giving good alignments for phylogenetic work with special
`
`Aligner
`
`Av score
`(218 families)
`
`BB11
`(38 families)
`
`BB12
`(44 families)
`
`BB2
`(41 families)
`
`BB3
`(30 families)
`
`BB4
`(49 families)
`
`BB5
`(16 families)
`
`Tot time (s) Consistency
`
`MSAprobs
`Probalign
`MAFFT (auto)
`
`Probcons
`Clustal O
`T-Coffee
`Kalign
`MUSCLE
`MAFFT (default)
`FSA
`Dialign
`PRANK
`ClustalW
`
`0.607
`0.589
`0.588
`
`0.558
`0.554
`0.551
`0.501
`0.475
`0.458
`0.419
`0.415
`0.376
`0.374
`
`0.441
`0.453
`0.439
`
`0.417
`0.358
`0.410
`0.365
`0.318
`0.258
`0.270
`0.265
`0.223
`0.227
`
`0.865
`0.862
`0.831
`
`0.855
`0.789
`0.848
`0.790
`0.804
`0.749
`0.818
`0.696
`0.680
`0.712
`
`0.464
`0.439
`0.450
`
`0.406
`0.450
`0.402
`0.360
`0.350
`0.316
`0.187
`0.292
`0.257
`0.220
`
`0.607
`0.566
`0.581
`
`0.544
`0.575
`0.491
`0.476
`0.409
`0.425
`0.259
`0.312
`0.321
`0.272
`
`0.622
`0.603
`0.605
`
`0.532
`0.579
`0.545
`0.504
`0.450
`0.480
`0.474
`0.441
`0.360
`0.396
`
`0.608
`0.549
`0.591
`
`0.573
`0.533
`0.587
`0.435
`0.460
`0.496
`0.398
`0.425
`0.356
`0.308
`
`12 382.00
`10 095.20
`1475.40
`
`13 086.30
`539.91
`81041.50
`21.88
`789.57
`68.24
`53 648.10
`3977.44
`128 355.00
`766.47
`
`Yes
`Yes
`Mostly
`(203/218)
`Yes
`No
`Yes
`No
`No
`No
`No
`No
`No
`No
`
`The figures are total column scores produced using bali score on core columns only. The average score over all families is given in the second column. The results for
`BAliBASE subgroupings are in columns 3–8. The total run time for all 218 families is given in the second last column. The last column indicates whether the method is
`consistency based.
`
`2 Molecular Systems Biology 2011
`
`& 2011 EMBO and Macmillan Publishers Limited
`
`2 of 6
`
`OnCusp
`Ex. 1022
`
`

`

`High-quality protein MSAs using Clustal Omega
`F Sievers et al
`
`ifthemethodisconsistencybased.
`Totalcolumnscores(TC)areshownfordifferentpercentidentityranges;thesecondcolumnistheaveragescoreoveralltestcases.Thetotalruntimeinsecondsisshowninthesecondlastcolumn.Thelastcolumnindicates
`
`No
`No
`No
`No
`No
`No
`No
`No
`Yes
`Yes
`Yes
`
`Yes
`Yes
`
`229391.00
`351498.00
`18909.70
`3433.53
`80.81
`225.56
`2068.56
`1698.06
`175789.00
`46908.30
`35117.30
`
`4544.45
`51286.00
`
`families)
`
`Consistency
`
`Totaltime(s)(1682
`
`0.976
`0.978
`0.974
`0.975
`0.979
`0.979
`0.976
`0.980
`0.972
`0.972
`0.977
`
`0.979
`0.971
`
`0.965
`0.951
`0.940
`0.933
`0.957
`0.961
`0.946
`0.967
`0.950
`0.955
`0.961
`
`0.961
`0.965
`
`0.791
`0.767
`0.783
`0.797
`0.817
`0.836
`0.850
`0.866
`0.865
`0.876
`0.881
`
`0.876
`0.889
`
`0.277
`0.390
`0.398
`0.430
`0.474
`0.513
`0.507
`0.535
`0.558
`0.562
`0.563
`
`0.569
`0.591
`
`0.534
`0.586
`0.595
`0.617
`0.649
`0.677
`0.677
`0.700
`0.710
`0.717
`0.719
`
`0.721
`0.737
`
`70p%IDp100(90
`
`families)
`
`40p%IDp70(117
`
`families)
`
`20p%IDp40(563
`
`families)
`
`0p%IDp20(912
`
`families)
`
`0o%IDp100(1682
`
`families)
`
`FSA
`PRANK
`Dialign
`ClustalW2
`Kalign
`MAFFT
`MUSCLE
`ClustalO
`T-Coffee
`Probcons
`Probalign
`(auto)
`MAFFT
`MSAprobs
`
`Aligner
`
`TableIIPrefabresults
`
`attention to gaps. These gap positions are not included in these
`tests as they tend not to be structurally conserved. Dialign
`(Morgenstern et al, 1998) does not use consistency or
`progressive alignment but is based on finding best local multiple
`alignments. FSA (Bradley et al, 2009) uses sampling of pairwise
`alignments and ‘sequence annealing’ and has been shown to
`deliver good nucleotide sequence alignments in the past.
`The Prefab benchmark test results are shown in Table II.
`Here, the results are divided into five groups according to the
`percent identity of the sequences. The overall scores range
`from 53 to 73% of columns correct. The consistency-based
`programs MSAprobs, MAFFT L-INS-i, Probalign, Probcons and
`T-Coffee, are again the most accurate but with long run times.
`Clustal Omega is close to the consistency programs in accuracy
`but is much faster. There is then a gap to the faster progressive
`based programs of MUSCLE, MAFFT, Kalign (Lassmann and
`Sonnhammer, 2005) and Clustal W.
`Results from testing large alignments with up to 50 000
`sequences are given in Table III using HomFam. Here, each
`alignment is made up of a core of a Homstrad (Mizuguchi et al,
`1998) structure-based alignment of at least five sequences. These
`sequences are then inserted into a test set of sequences from the
`corresponding, homologous, Pfam domain. This gives very large
`sets of sequences to be aligned but the testing is only carried out
`on the sequences with known structures. Only some programs
`are able to deliver alignments at all, with data sets of this size. We
`restricted the comparisons to Clustal Omega, MAFFT, MUSCLE
`and Kalign. MAFFT with default settings, has a limit of 20 000
`sequences and we only use MAFFT with –parttree for the last
`section of Table III. MUSCLE becomes increasingly slow when
`you get over 3000 sequences. Therefore, for 43000 sequences
`we used MUSCLE with the faster but less accurate setting of –
`maxiters 2, which restricts the number of iterations to two.
`Overall, Clustal Omega is easily the most accurate program
`in Table III. The run times show MAFFT default and Kalign
`to be exceptionally fast on the smaller test cases and MAFFT
`–parttree to be very fast on the biggest families. Clustal
`Omega does scale well, however, with increasing numbers of
`sequences. This scaling is described in more detail in the
`Supplementary Information. We do have two further test cases
`with 450 000 sequences, but it was not possible to get results
`for these from MUSCLE or Kalign. These are described in the
`Supplementary Information as well.
`Table III gives overall run times for the four programs
`evaluated with HomFam. Figure 1 resolves these run times
`case by case. Kalign is very fast for small families but does not
`scale as well. Overall, MAFFT is faster than the other programs
`over all test case sizes but Clustal Omega scales similarly.
`Points in Figure 1 represent different families with different
`average sequence lengths and pairwise identities. Therefore,
`the scalability trend is fuzzy, with larger dots occurring
`generally above smaller dots. Supplementary Figure S3 shows
`scalability data, where subsets of increasing size are sampled
`from one large family only. This reduces variability in pairwise
`identity and sequence length.
`
`External profile alignment
`
`Clustal Omega can read extra information from a profile HMM
`derived from preexisting alignments. For example, if a user
`
`& 2011 EMBO and Macmillan Publishers Limited
`
`Molecular Systems Biology 2011 3
`
`3 of 6
`
`OnCusp
`Ex. 1022
`
`

`

`horizontal axis using a log scale. With some smaller test cases,
`iteration actually has a detrimental effect. Once you get near
`1000 or more sequences, however, a clear trend emerges.
`The more sequences you have, the more beneficial the effect of
`iteration is. With bigger test cases, it becomes more and more
`beneficial
`to apply two iterations. This result confirms
`the usefulness of EPA as a general strategy. It also confirms
`the difficulty in aligning extremely large numbers of sequences
`but gives one partial solution. It also gives a very simple but
`effective iteration scheme, not just for guide tree iteration, as
`used in many packages, but for iteration of the alignment itself.
`
`Discussion
`
`The main breakthroughs since the mid 1980s in MSA methods
`have been progressive alignment and the use of consistency.
`Otherwise, most recent work has concerned refinements for
`speed or accuracy on benchmark test sets. The speed increases
`have been dramatic but, with just two major exceptions, the
`methods are still basically O(N2) and incapable of being
`extended to data sets of 410 000 sequences. The two
`exceptions are mBed, used here, and MAFFT PartTree. PartTree
`is faster but at the expense of accuracy, at least as judged by the
`benchmarking here. The second group of recent developments
`
`HomFam
`
`ClustalΩ
`Mafft
`Muscle
`Kalign
`Avr length:
`1−50
`50 −100
`100 −150
`150 −200
`200 −250
`250 −300
`300 −350
`350 −400
`400 −450
`
`100
`
`3000
`#Sequences
`
`10 000
`
`100 000
`
` 100 000
`
` 10 000
`
` 1000
`
` 100
`
` 10
`
` 1
`
` 0.1
`
` 0.01
`
` 0.001
`
`Time (s)
`
`Figure 1 Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE
`(green) and Kalign (purple) against the number of sequences of HomFam test
`sets. Average sequence length is rendered by point size. Both axes have
`logarithmic scales. Clustal Omega and Kalign were run with default flags over the
`entire range. MUSCLE was run with –maxiters 2 for N43000 sequences.
`MAFFT was run with –parttree for N410 000 sequences.
`
`High-quality protein MSAs using Clustal Omega
`F Sievers et al
`
`wishes to align a set of globin sequences and has an existing
`globin alignment, this alignment can be converted to a profile
`HMM and used as well as the sequence input file. This HMM is
`here referred to as an ‘external profile’ and its use in this way
`as ‘external profile alignment’ (EPA). During EPA, each
`sequence in the input set is aligned to the external profile.
`Pseudocount information from the external profile is then
`transferred, position by position,
`to the input sequence.
`Ideally, this would be used with large curated alignments of
`particular proteins or domains of interest such as are used in
`metagenomics projects. Rather than taking the input se-
`quences and aligning them from scratch, every time new
`sequences are found,
`the alignment should be carefully
`maintained and used as an external profile for EPA. Clustal
`Omega also can align sequences to existing alignments using
`conventional alignment methods. Users can add sequences to
`an alignment, one by one or align a set of aligned sequences to
`the alignment.
`In this paper, we demonstrate the EPA approach with two
`examples. First, we take the 94 HomFam test cases from the
`previous section and use the corresponding Pfam HMM for
`EPA. Before EPA, the average accuracy for the test cases was
`0.627 of correctly aligned Homstrad positions but after EPA it
`rises to 0.653. This is plotted, test case for test case in
`Figure 2A. Each dot is one test case with the TC score for
`Clustal Omega plotted against the score using EPA. The second
`example is illustrated in Figure 2B. Here, we take all the
`BAliBASE reference sets and align them as normal using
`Clustal Omega and obtain the benchmark result of 0.554 of
`columns correctly aligned, as already reported in Table I. For
`EPA, we use the benchmark reference alignments themselves
`as external profiles. The results now jump to 0.857 of columns
`correct. This is a jump of over 30% and while it is not a valid
`measure of Clustal Omega accuracy for comparison with other
`programs, it does illustrate the potential power of EPA to use
`information in external alignments.
`
`Iteration
`
`EPA can also be used in a simple iteration scheme. Once a MSA
`has been made from a set of input sequences, it can be
`converted into a HMM and used for EPA to help realign the
`input sequences. This can also be combined with a full
`recalculation of the guide tree. In Figure 3, we show the results
`of one and two iterations on every test case from HomFam.
`The graph is plotted as a running average TC score for all test
`cases with N or fewer test cases where N is plotted on the
`
`Table III HomFam benchmarking results
`
`93pNp2957 (41 families)
`
`3127pNp9105 (33 families)
`
`10 099pNp50157 (18 families)
`
`Aligner
`Clustal O
`Kalign
`MAFFT default
`MAFFT –parttree
`MUSCLE default
`MUSCLE –maxiters 2
`
`TC/t (s)
`0.708/2114.0
`0.569/324.9
`0.550/238.9
`/
`0.533/104 587.0
`/
`
`TC/t (s)
`0.639/11 719.5
`0.563/6752.0
`0.462/3115.4
`/
`/
`0.416/8239.2
`
`TC/t (s)
`0.464/27 328.9
`0.420/286 711.0
`/
`0.253/6119.4
`/
`0.216/110 292.0
`
`The columns show total column score (TC) and total run time in seconds for groupings of small (o3000 sequences), medium (3000–10 000 sequences) and large
`(410 000 sequences) HomFam test cases.
`
`4 Molecular Systems Biology 2011
`
`& 2011 EMBO and Macmillan Publishers Limited
`
`4 of 6
`
`OnCusp
`Ex. 1022
`
`

`

`High-quality protein MSAs using Clustal Omega
`F Sievers et al
`
`Figure 2 EPA for HomFam and BAliBASE. Points represent TC scores of Clustal Omega alignment with EPA versus TC scores of default Clustal Omega alignment
`(without EPA). Points above bisectrix represent beneficial effect of EPA, points below deleterious effect. Average improvement in (A) 2.5%. HMMs taken from Pfam,
`benchmarking carried out using corresponding structure-based alignment in Homstrad. Average improvement in (B) over 30%. Here, test sets and EPA-HMMs were
`both derived from BAliBASE reference alignments.
`
`been the use of external information such as RNA structure
`(Wilm et al, 2008) or protein structure predictions (Pirovano
`et al, 2008).
`EPA is a new approach that allows users to exploit
`information in their own or in publicly available alignments.
`It does not force new sequences to follow the older alignment
`exactly. The new sequences get aligned to each other using
`progressive alignment but the information in the external
`profile can help provide information as to which amino
`acids are most likely to occur at each position in a sequence.
`Most methods attempt to predict this from general models of
`protein evolution with secondary structure prediction as a
`refinement. In this paper, we have shown that even using
`the mass produced alignments from Pfam as external profiles
`provides a small increase in accuracy for a large general set of
`test cases. This opens up a new set of possibilities for users to
`make use of the information contained in large, publicly
`available alignments and creates an incentive for database
`providers to make very high-quality alignments available.
`One of the reasons for the great success of Clustal X was
`the very user-friendly graphical user interface (GUI). This,
`however, is not as critical as in the past due to the widespread
`availability of web-based services where the GUI is provided
`by the web-based front-end server. Further, there are several
`very high-quality alignment viewers and editors such as
`Jalview (Clamp et al, 2004) and Seaview (Gouy et al, 2010) that
`read Clustal Omega output or which can call Clustal Omega
`directly.
`
`Materials and methods
`Clustal Omega is licensed under the GNU Lesser General Public
`License. Source code as well as precompiled binaries for Linux,
`FreeBSD, Windows and Mac (Intel and PowerPC) are available at
`http://www.clustal.org. Clustal Omega is available as a command line
`program only, which uses GNU-style command line options, and also
`accepts ClustalW-style command options for backwards compatibility
`and easy integration into existing pipelines.
`Clustal Omega is written in C and C þ þ and makes use of a number
`of excellent free software packages. We used a modified version of
`
`ClustalΩ HomFam
`
`Default
`1 Iteration
`2 Iterations
`
`100
`
`1000
`#Sequences
`
`10 000
`
`100 000
`
` 0.78
`
` 0.76
`
` 0.74
`
` 0.72
`
` 0.7
`
` 0.68
`
` 0.66
`
` 0.64
`
` 0.62
`
` 0.6
`
` 0.58
`10
`
`TC score (running average)
`
`Figure 3
`Iteration of HomFam alignments. Points represent cumulative running
`averages of TC scores. Clustal Omega default results in black, results after 1
`iteration in red, after 2 iterations in blue. Iterations are combined HMM/guide tree
`iterations; x axis, logarithmic and y axis, linear scale.
`
`have concerned accuracy. This has tended to focus on results
`from benchmarking, a potentially contentious issue (Aniba
`et al, 2010; Edgar, 2010). The benchmark test sets that we have
`are limited in scope and heavily biased toward single domain
`globular proteins. This has the potential to lead to methods
`that behave well on benchmarks but which are not so flexible
`or useful in real-world situations. One development to improve
`accuracy has been the recruitment of extra homologs to
`bulk up input data sets. This seems to work well with the
`consistency-based methods and for small data sets. It appears,
`however, that there is a limit to the extra accuracy that can be
`obtained this way, without further development. The extra
`sequences may also bring in noise and dramatically increase
`the complexity of the computational problem. This can be
`partly fixed by iteration but, EPA to a high-quality reference
`alignment might be a better solution. This also raises the need
`for methods to visualize such large alignments, in order to
`detect problems. A second major focus for development has
`
`& 2011 EMBO and Macmillan Publishers Limited
`
`Molecular Systems Biology 2011 5
`
`5 of 6
`
`OnCusp
`Ex. 1022
`
`

`

`High-quality protein MSAs using Clustal Omega
`F Sievers et al
`
`Sean Eddy’s Squid library (http://selab.janelia.org/software.html) for
`sequence I/O, allowing the use of a wide variety of file formats. We use
`David Arthur’s k-meansþ þ code (Arthur and Vassilvitskii, 2007) for
`fast clustering of sequence vectors. Code for fast UPGMA and guide
`tree handling routines was adopted from MUSCLE (Edgar, 2004).
`We use the OpenMP library to enable multithreaded computation of
`pairwise distances and alignment match states. The documentation
`for Clustal Omega’s API is part of the source code, and in addition
`is available from http://www.clustal.org/omega/clustalo-api/. Full
`details of all algorithms are given in the accompanying Supplementary
`Information.
`The benchmarks that were used were BAliBASE 3 (Thompson et al,
`2005), PREFAB 4.0 (posted March 2005) (Edgar, 2010) and a newly
`constructed data set (HomFam) using sequences from Pfam (version
`25) and Homstrad (as of 2011-06-13) (Mizuguchi et al, 1998). The
`programs that were compared can be obtained from:
`ClustalW2, v2.1 (http://www.clustal.org)
`DIALIGN 2.2.1 (http://dialign.gobics.de/)
`FSA 1.15.5 (http://sourceforge.net/projects/fsa/)
`Kalign 2.04 (http://msa.sbc.su.se/cgi-bin/msa.cgi)
`MAFFT 6.857 (http://mafft.cbrc.jp/alignment/software/source.html)
`MSAProbs 0.9.4 (http://sourceforge.net/projects/msaprobs/files/)
`MUSCLE version 3.8.31 posted 1 May 2010 (http://www.drive5.
`com/muscle/downloads.htm)
`PRANK v.100802, 2 August 2010 (http://www.ebi.ac.uk/goldman-srv/
`prank/src/prank/)
`Probalign v1.4 (http://cs.njit.edu/usman/probalign/)
`PROBCONS version 1.12 (http://probcons.stanford.edu/download.html)
`T-Coffee Version 8.99 (http://www.tcoffee.org/Projects_home_
`page/t_coffee_home_page.html#DOWNLOAD).
`
`Supplementary information
`Supplementary information is available at the Molecular Systems
`Biology website (www.nature.com/msb).
`
`Acknowledgements
`This work was supported by Science Foundation Ireland (Grant
`number: 07/IN.1/B1783).
`Author contributions: DGH initiated the project;
`the original
`ClustalW software was produced by JDT, DGH and TJG. The HHalign
`code was written by JS and is maintained by MR. The idea of EPA was
`suggested by KK. FS adapted HHalign to Clustal Omega; AW devised
`and adapted guide tree construction routines. DD parallelised the code.
`AW, DD and FS carried out all software development. The EBI server
`was set up by RL, HMcW and WL. Benchmarking was carried out by
`JDT, AW, DD and FS. All authors contributed to the manuscript.
`
`Conflict of interest
`The authors declare that they have no conflict of interest.
`
`References
`
`Aniba MR, Poch O, Thompson JD (2010) Issues in bioinformatics
`benchmarking: the case study of multiple sequence alignment.
`Nucleic Acids Res 38: 7353–7363
`Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful
`seeding. Proceedings of
`the Eighteenth Annual ACM-SIAM
`Symposium on Discrete Algorithms. pp 1027–1035
`Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (2010) Sequence
`emBedding for fast construction of guide trees for multiple
`sequence alignment. Algorithms Mol Biol 5: 21
`Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I,
`Pachter L (2009) Fast statistical alignment. PLoS Comput Biol 5:
`e1000392
`
`Clamp M, Cuff J, Searle SM, Barton GJ (2004) The Jalview Java
`alignment editor. Bioinformatics 20: 426–427
`Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S (2005) ProbCons:
`probabilistic consistency-based multiple sequence alignment.
`Genome Res 15: 330–340
`Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:
`755–763
`Edgar RC (2004) MUSCLE: multiple sequence alignment with high
`accuracy and high throughput. Nucleic Acids Res 32: 1792–1797
`Edgar RC (2010) Quality measures for protein alignment benchmarks.
`Nucleic Acids Res 38: 2145–2153
`Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL,
`Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL,
`Eddy SR, Bateman A (2009) The Pfam protein families database.
`Nucleic Acids Res 38: D211–D222
`Gouy M, Guindon S, Gascuel O (2010) SeaView version 4: a
`multiplatform graphical user interface for sequence alignment
`and phylogenetic tree building. Mol Biol Evol 27: 221–224
`Hogeweg P, Hesper B (1984) The alignment of sets of sequences and
`the construction of phyletic trees: an integrated method. J Mol Evol
`20: 175–186
`Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method
`for rapid multiple sequence alignment based on fast Fourier
`transform. Nucleic Acids Res 30: 3059–3066
`Katoh K, Toh H (2007) PartTree: an algorithm to build an approximate
`tree from a large number of unaligned sequences. Bioinformatics
`23: 372–374
`Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA,
`McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R,
`Thompson JD, Gibson TJ, Higgins DG (2007) Clustal W and
`Clustal X version 2.0. Bioinformatics 23: 2947–2948
`Lassmann T, Sonnhammer ELL (2005) Kalign—an accurate and
`fast multiple sequence alignment algorithm. BMC Bioinformatics
`6: 298
`Liu Y, Schmidt B, Maskell DL (2010) MSAProbs: multiple
`sequence alignment based on pair hidden Markov models
`and partition function posterior probabilities. Bioinformatics 26:
`1958–1964
`Lo¨ytynoja A, Goldman N (2008) Phylogeny-aware gap placement
`prevents errors in sequence alignment and evolutionary analysis.
`Science 320: 1632–1635
`Mizuguchi K, Deane CM, Blundell TL, Overington JP (1998)
`HOMSTRAD: a database of protein structure alignments for
`homologous families. Protein Sci 7: 2469–2471
`Morgenstern B, Frech K, Dress A, Werner T (1998) DIALIGN: finding
`local similarities by multiple sequence alignment. Bioinformatics
`14: 290–294
`Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method
`for fast and accurate multiple sequence alignment. J Mol Biol 302:
`205–217
`Pirovano W, Feenstra KA, Heringa J (2008) PRALINETM: a strategy for
`alignment of
`transmembrane proteins.
`improved multiple
`Bioinformatics 24: 492–497
`(2005) Protein homology detection by HMM–HMM
`So¨ding J
`comparison. Bioinformatics 21: 951–960
`Thompson JD, Koehl P, Ripp R, Poch O (2005) BAliBASE 3.0: latest
`developments of the multiple sequence alignment benchmark.
`Proteins 61: 127–136
`Wilm A, Higgins DG, Notredame C (2008) R-Coffee: a method
`for multiple alignment of non-coding RNA. Nucleic Acids Res 36:
`e52
`
`Molecular Systems Biology is an open-access journal
`published by European Molecular Biology Organiza-
`tion and Nature Publishing Group. This work is licensed under a
`Creative Commons Attribution-Noncommercial-Share Alike 3.0
`Unported License.
`
`6 Molecular Systems Biology 2011
`
`& 2011 EMBO and Macmillan Publishers Limited
`
`6 of 6
`
`OnCusp
`Ex. 1022
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket