throbber
Highly accurate protein structure prediction
`with AlphaFold
`
`https://doi.org/10.1038/s41586-021-03819-2
`Received: 11 May 2021
`Accepted: 12 July 2021
`Published online: 15 July 2021
`Open access
` Check for updates
`
`John Jumper1,4 ✉, Richard Evans1,4, Alexander Pritzel1,4, Tim Green1,4, Michael Figurnov1,4,
`Olaf Ronneberger1,4, Kathryn Tunyasuvunakool1,4, Russ Bates1,4, Augustin Žídek1,4,
`Anna Potapenko1,4, Alex Bridgland1,4, Clemens Meyer1,4, Simon A. A. Kohl1,4,
`Andrew J. Ballard1,4, Andrew Cowie1,4, Bernardino Romera-Paredes1,4, Stanislav Nikolov1,4,
`Rishub Jain1,4, Jonas Adler1, Trevor Back1, Stig Petersen1, David Reiman1, Ellen Clancy1,
`Michal Zielinski1, Martin Steinegger2,3, Michalina Pacholska1, Tamas Berghammer1,
`Sebastian Bodenstein1, David Silver1, Oriol Vinyals1, Andrew W. Senior1, Koray Kavukcuoglu1,
`Pushmeet Kohli1 & Demis Hassabis1,4 ✉
`
`Proteins are essential to life, and understanding their structure can facilitate a
`mechanistic understanding of their function. Through an enormous experimental
`effort1–4, the structures of around 100,000 unique proteins have been determined5, but
`this represents a small fraction of the billions of known protein sequences6,7. Structural
`coverage is bottlenecked by the months to years of painstaking effort required to
`determine a single protein structure. Accurate computational approaches are needed
`to address this gap and to enable large-scale structural bioinformatics. Predicting the
`three-dimensional structure that a protein will adopt based solely on its amino acid
`sequence—the structure prediction component of the ‘protein folding problem’8—has
`been an important open research problem for more than 50 years9. Despite recent
`progress10–14, existing methods fall far short of atomic accuracy, especially when no
`homologous structure is available. Here we provide the first computational method
`that can regularly predict protein structures with atomic accuracy even in cases in which
`no similar structure is known. We validated an entirely redesigned version of our neural
`network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein
`Structure Prediction (CASP14)15, demonstrating accuracy competitive with
`experimental structures in a majority of cases and greatly outperforming other
`methods. Underpinning the latest version of AlphaFold is a novel machine learning
`approach that incorporates physical and biological knowledge about protein structure,
`leveraging multi-sequence alignments, into the design of the deep learning algorithm.
`
`The development of computational methods to predict
`three-dimensional (3D) protein structures from the protein sequence
`has proceeded along two complementary paths that focus on either the
`physical interactions or the evolutionary history. The physical interac-
`tion programme heavily integrates our understanding of molecular
`driving forces into either thermodynamic or kinetic simulation of pro-
`tein physics16 or statistical approximations thereof17. Although theoreti-
`cally very appealing, this approach has proved highly challenging for
`even moderate-sized proteins due to the computational intractability
`of molecular simulation, the context dependence of protein stability
`and the difficulty of producing sufficiently accurate models of protein
`physics. The evolutionary programme has provided an alternative in
`recent years, in which the constraints on protein structure are derived
`from bioinformatics analysis of the evolutionary history of proteins,
`homology to solved structures18,19 and pairwise evolutionary correla-
`tions20–24. This bioinformatics approach has benefited greatly from
`
`the steady growth of experimental protein structures deposited in
`the Protein Data Bank (PDB)5, the explosion of genomic sequencing
`and the rapid development of deep learning techniques to interpret
`these correlations. Despite these advances, contemporary physical
`and evolutionary-history-based approaches produce predictions that
`are far short of experimental accuracy in the majority of cases in which
`a close homologue has not been solved experimentally and this has
`limited their utility for many biological applications.
`In this study, we develop the first, to our knowledge, computational
`approach capable of predicting protein structures to near experimental
`accuracy in a majority of cases. The neural network AlphaFold that we
`developed was entered into the CASP14 assessment (May–July 2020;
`entered under the team name ‘AlphaFold2’ and a completely different
`model from our CASP13 AlphaFold system10). The CASP assessment is
`carried out biennially using recently solved structures that have not
`been deposited in the PDB or publicly disclosed so that it is a blind test
`
`1DeepMind, London, UK. 2School of Biological Sciences, Seoul National University, Seoul, South Korea. 3Artificial Intelligence Institute, Seoul National University, Seoul, South Korea. 4These
`authors contributed equally: John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna
`Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Demis Hassabis.
`✉e-mail: jumper@deepmind.com; dhcontact@deepmind.com
`
`Nature | Vol 596 | 26 August 2021 | 583
`
`Article
`
`Inari Ex. 1021
`Inari Agric. v. Corteva Agriscience
`PGR2023-00022
`Page 00001
`
`

`

`b
`
`c
`
`d
`
`N terminus
`
`C terminus
`
`AlphaFold Experiment
`r.m.s.d.95 = 0.8 Å; TM-score = 0.93
`
`AlphaFold Experiment
`r.m.s.d. = 0.59 Å within 8 Å of Zn
`
`AlphaFold Experiment
`r.m.s.d.95 = 2.2 Å; TM-score = 0.96
`
`G216
`G253
`G362
`G324
`G368
`G488
`G498
`G480
`G420
`G032
`G403
`G129
`G473
`G009
`AlphaFold
`G427
`
`4
`
`3
`
`2
`
`1
`
`0
`
`a
`
`Median C r.m.s.d.95 (Å)
`
`e
`
`Input sequence
`
`Genetic
`database
`search
`
`Pairing
`
`Structure
`database
`search
`
`MSA
`
`Templates
`
`+
`
`MSA
`representation
`(s,r,c)
`
`Single repr. (r,c)
`
`High
`con(cid:31)dence
`
`Low
`con(cid:31)dence
`
`Evoformer
`(48 blocks)
`
`Structure
`module
` (8 blocks)
`
`+
`
`Pair
`representation
`(r,r,c)
`
`Pair
`representation
`(r,r,c)
`
`3D structure
`
`ࢎ Recycling (three times)
`
`Fig. 1 | AlphaFold produces highly accurate structures. a, The performance
`of AlphaFold on the CASP14 dataset (n = 87 protein domains) relative to the top-
`15 entries (out of 146 entries), group numbers correspond to the numbers
`assigned to entrants by CASP. Data are median and the 95% confidence interval
`of the median, estimated from 10,000 bootstrap samples. b, Our prediction of
`CASP14 target T1049 (PDB 6Y4F, blue) compared with the true (experimental)
`structure (green). Four residues in the C terminus of the crystal structure are
`B-factor outliers and are not depicted. c, CASP14 target T1056 (PDB 6YJ1).
`
`An example of a well-predicted zinc-binding site (AlphaFold has accurate side
`chains even though it does not explicitly predict the zinc ion). d, CASP target
`T1044 (PDB 6VR4)—a 2,180-residue single chain—was predicted with correct
`domain packing (the prediction was made after CASP using AlphaFold without
`intervention). e, Model architecture. Arrows show the information flow among
`the various components described in this paper. Array shapes are shown in
`parentheses with s, number of sequences (Nseq in the main text); r, number of
`residues (Nres in the main text); c, number of channels.
`
`for the participating methods, and has long served as the gold-standard
`assessment for the accuracy of structure prediction25,26.
`In CASP14, AlphaFold structures were vastly more accurate than
`competing methods. AlphaFold structures had a median backbone
`accuracy of 0.96 Å r.m.s.d.95 (Cα root-mean-square deviation at 95%
`residue coverage) (95% confidence interval = 0.85–1.16 Å) whereas
`the next best performing method had a median backbone accuracy
`of 2.8 Å r.m.s.d.95 (95% confidence interval = 2.7–4.0 Å) (measured on
`CASP domains; see Fig. 1a for backbone accuracy and Supplementary
`Fig. 14 for all-atom accuracy). As a comparison point for this accuracy,
`the width of a carbon atom is approximately 1.4 Å. In addition to very
`accurate domain structures (Fig. 1b), AlphaFold is able to produce
`highly accurate side chains (Fig. 1c) when the backbone is highly accu-
`rate and considerably improves over template-based methods even
`when strong templates are available. The all-atom accuracy of Alpha-
`Fold was 1.5 Å r.m.s.d.95 (95% confidence interval = 1.2–1.6 Å) compared
`with the 3.5 Å r.m.s.d.95 (95% confidence interval = 3.1–4.2 Å) of the best
`alternative method. Our methods are scalable to very long proteins with
`accurate domains and domain-packing (see Fig. 1d for the prediction
`of a 2,180-residue protein with no structural homologues). Finally, the
`model is able to provide precise, per-residue estimates of its reliability
`that should enable the confident use of these predictions.
`We demonstrate in Fig. 2a that the high accuracy that AlphaFold dem-
`onstrated in CASP14 extends to a large sample of recently released PDB
`
`structures; in this dataset, all structures were deposited in the PDB after
`our training data cut-off and are analysed as full chains (see Methods,
`Supplementary Fig. 15 and Supplementary Table 6 for more details).
`Furthermore, we observe high side-chain accuracy when the back-
`bone prediction is accurate (Fig. 2b) and we show that our confidence
`measure, the predicted local-distance difference test (pLDDT), reliably
`predicts the Cα local-distance difference test (lDDT-Cα) accuracy of the
`corresponding prediction (Fig. 2c). We also find that the global super-
`position metric template modelling score (TM-score)27 can be accu-
`rately estimated (Fig. 2d). Overall, these analyses validate that the high
`accuracy and reliability of AlphaFold on CASP14 proteins also transfers
`to an uncurated collection of recent PDB submissions, as would be
`expected (see Supplementary Methods 1.15 and Supplementary Fig. 11
`for confirmation that this high accuracy extends to new folds).
`
`The AlphaFold network
`AlphaFold greatly improves the accuracy of structure prediction by
`incorporating novel neural network architectures and training proce-
`dures based on the evolutionary, physical and geometric constraints
`of protein structures. In particular, we demonstrate a new architecture
`to jointly embed multiple sequence alignments (MSAs) and pairwise
`features, a new output representation and associated loss that enable
`accurate end-to-end structure prediction, a new equivariant attention
`
`584 | Nature | Vol 596 | 26 August 2021
`
`Article
`
`PGR2023-00022 Page 00002
`
`

`

`architecture, use of intermediate losses to achieve iterative refinement
`of predictions, masked MSA loss to jointly train with the structure,
`learning from unlabelled protein sequences using self-distillation and
`self-estimates of accuracy.
`The AlphaFold network directly predicts the 3D coordinates of all
`heavy atoms for a given protein using the primary amino acid sequence
`and aligned sequences of homologues as inputs (Fig. 1e; see Methods
`for details of inputs including databases, MSA construction and use of
`templates). A description of the most important ideas and components
`is provided below. The full network architecture and training procedure
`are provided in the Supplementary Methods.
`The network comprises two main stages. First, the trunk of the net-
`work processes the inputs through repeated layers of a novel neural
`network block that we term Evoformer to produce an Nseq × Nres array
`(Nseq, number of sequences; Nres, number of residues) that represents
`a processed MSA and an Nres × Nres array that represents residue pairs.
`The MSA representation is initialized with the raw MSA (although
`see Supplementary Methods 1.2.7 for details of handling very deep
`MSAs). The Evoformer blocks contain a number of attention-based
`and non-attention-based components. We show evidence in ‘Interpret-
`ing the neural network’ that a concrete structural hypothesis arises
`early within the Evoformer blocks and is continuously refined. The key
`innovations in the Evoformer block are new mechanisms to exchange
`information within the MSA and pair representations that enable direct
`reasoning about the spatial and evolutionary relationships.
`The trunk of the network is followed by the structure module that
`introduces an explicit 3D structure in the form of a rotation and transla-
`tion for each residue of the protein (global rigid body frames). These
`representations are initialized in a trivial state with all rotations set to
`the identity and all positions set to the origin, but rapidly develop and
`refine a highly accurate protein structure with precise atomic details.
`Key innovations in this section of the network include breaking the
`chain structure to allow simultaneous local refinement of all parts of
`the structure, a novel equivariant transformer to allow the network to
`implicitly reason about the unrepresented side-chain atoms and a loss
`term that places substantial weight on the orientational correctness
`of the residues. Both within the structure module and throughout
`the whole network, we reinforce the notion of iterative refinement
`by repeatedly applying the final loss to outputs and then feeding the
`outputs recursively into the same modules. The iterative refinement
`using the whole network (which we term ‘recycling’ and is related to
`approaches in computer vision28,29) contributes markedly to accuracy
`with minor extra training time (see Supplementary Methods 1.8 for
`details).
`
`Evoformer
`The key principle of the building block of the network—named Evo-
`former (Figs. 1e, 3a)—is to view the prediction of protein structures
`as a graph inference problem in 3D space in which the edges of the
`graph are defined by residues in proximity. The elements of the pair
`representation encode information about the relation between the
`residues (Fig. 3b). The columns of the MSA representation encode the
`individual residues of the input sequence while the rows represent
`the sequences in which those residues appear. Within this framework,
`we define a number of update operations that are applied in each block
`in which the different update operations are applied in series.
`The MSA representation updates the pair representation through an
`element-wise outer product that is summed over the MSA sequence
`dimension. In contrast to previous work30, this operation is applied
`within every block rather than once in the network, which enables the
`continuous communication from the evolving MSA representation to
`the pair representation.
`Within the pair representation, there are two different update pat-
`terns. Both are inspired by the necessity of consistency of the pair
`
`Nature | Vol 596 | 26 August 2021 | 585
`
`20
`
`80
`60
`40
`lDDT-C of a residue
`
`100
`
`100
`
`90
`
`1.0
`
`0.9
`
`0.8
`
`0.7
`
`0.6
`
`0.5
`
`b
`
`Fraction of correct (cid:70)1 rotamers
`
`>8
`
`4–8
`
`2–4
`
`1–2
`
`0.5–1
`
`0–0.5
`
`Full chain C r.m.s.d.95 (Å)
`
`0.30
`
`0.25
`
`0.20
`
`0.15
`
`0.10
`
`0.05
`
`0
`
`Fraction of proteinsa
`
`c
`
`100
`
`80
`
`60
`
`40
`
`20
`
`lDDT-C
`
`80
`60
`40
`20
`Average pLDDT on the resolved region
`
`100
`
`80
`
`80
`
`90
`
`100
`
`1.0
`
`0.9
`
`0.8
`
`0.8
`
`0.9
`
`1.0
`
`0.8
`0.6
`0.4
`0.2
`pTM on the resolved region
`
`1.0
`
`0
`
`0
`
`d
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`TM-score
`
`0
`
`0
`
`Fig. 2 | Accuracy of AlphaFold on recent PDB structures. The analysed
`structures are newer than any structure in the training set. Further filtering is
`applied to reduce redundancy (see Methods). a, Histogram of backbone
`r.m.s.d. for full chains (Cα r.m.s.d. at 95% coverage). Error bars are 95%
`confidence intervals (Poisson). This dataset excludes proteins with a template
`(identified by hmmsearch) from the training set with more than 40% sequence
`identity covering more than 1% of the chain (n = 3,144 protein chains). The
`overall median is 1.46 Å (95% confidence interval = 1.40–1.56 Å). Note that this
`measure will be highly sensitive to domain packing and domain accuracy; a
`high r.m.s.d. is expected for some chains with uncertain packing or packing
`errors. b, Correlation between backbone accuracy and side-chain accuracy.
`Filtered to structures with any observed side chains and resolution better than
`2.5 Å (n = 5,317 protein chains); side chains were further filtered to
`B-factor <30 Å2. A rotamer is classified as correct if the predicted torsion angle
`is within 40°. Each point aggregates a range of lDDT-Cα, with a bin size of 2 units
`above 70 lDDT-Cα and 5 units otherwise. Points correspond to the mean
`accuracy; error bars are 95% confidence intervals (Student t-test) of the mean
`on a per-residue basis. c, Confidence score compared to the true accuracy on
`chains. Least-squares linear fit lDDT-Cα = 0.997 × pLDDT − 1.17 (Pearson’s
`r = 0.76). n = 10,795 protein chains. The shaded region of the linear fit
`represents a 95% confidence interval estimated from 10,000 bootstrap
`samples. In the companion paper39, additional quantification of the reliability
`of pLDDT as a confidence measure is provided. d, Correlation between pTM
`and full chain TM-score. Least-squares linear fit TM-score = 0.98 × pTM + 0.07
`(Pearson’s r = 0.85). n = 10,795 protein chains. The shaded region of the linear fit
`represents a 95% confidence interval estimated from 10,000 bootstrap
`samples.
`
`PGR2023-00022 Page 00003
`
`

`

`a
`
`MSA
`representation
`(s,r,c)
`
`Pair
`representation
`(r,r,c)
`
`48 blocks (no shared weights)
`
`Row-wise
`gated
`self-
`attention
`with pair
`bias
`
`+
`
`Column-
`wise
`gated
`self-
`attention
`
`+
`
`Tran-
`sition
`
`+
`
`MSA
`representation
`(s,r,c)
`
`Outer
`product
`mean
`
`+
`
`Triangle
`update
`using
`outgoing
`edges
`
`+
`
`Triangle
`update
`using
`incoming
`edges
`
`+
`
`Triangle
`self-
`attention
`around
`starting
`node
`
`+
`
`Triangle
`self-
`attention
`around
`ending
`node
`
`+
`
`Tran-
`sition
`
`+
`
`Pair
`representation
`(r,r,c)
`
`Pair representation
`(r,r,c)
`
`Corresponding edges
`in a graph
`
`c
`
`Triangle multiplicative update
`using ‘outgoing’ edges
`
`Triangle multiplicative update
`using ‘incoming’ edges
`
`Triangle self-attention around
`starting node
`
`Triangle self-attention around
`ending node
`
`i
`
`ik
`
`ij
`
`jk
`
`k
`
`j
`
`i
`
`ij
`
`j
`
`i
`
`ij
`
`j
`
`i
`
`ij
`
`j
`
`ki
`
`kj
`
`k
`
`ik
`
`jk
`
`ki
`
`kj
`
`k
`
`k
`
`i
`
`ki
`
`ik
`
`ji
`
`ij
`
`k
`
`j
`
`jk
`
`kj
`
`k i
`
`k
`
`jk
`
`j
`
`ij
`
`kj
`
`i
`
`ji
`ki
`
`Pair
`representation
`(r,r,c)
`
`b
`
`i j k
`
`d
`
`8 blocks (shared weights)
`
`e
`
`Predict (cid:70) angles
`and compute all
`atom positions
`
`Single repr. (r,c)
`
`IPA
`module
`
`+
`
`Single repr. (r,c)
`
`Predict relative
`rotations and
`translations
`
`Backbone frames
`(r, 3×3) and (r,3)
`(initially all at the origin)
`
`Backbone frames
`(r, 3×3) and (r,3)
`
`f
`
`(R
`k, t
`k)
`
`~
`(R
`~
`k, t
`k)
`
`~
`xi
`
`xi
`
`Fig. 3 | Architectural details. a, Evoformer block. Arrows show the information
`flow. The shape of the arrays is shown in parentheses. b, The pair representation
`interpreted as directed edges in a graph. c, Triangle multiplicative update and
`triangle self-attention. The circles represent residues. Entries in the pair
`representation are illustrated as directed edges and in each diagram, the edge
`being updated is ij. d, Structure module including Invariant point attention (IPA)
`
`module. The single representation is a copy of the first row of the MSA
`representation. e, Residue gas: a representation of each residue as one
`free-floating rigid body for the backbone (blue triangles) and χ angles for the
`side chains (green circles). The corresponding atomic structure is shown below.
`f, Frame aligned point error (FAPE). Green, predicted structure; grey, true
`structure; (Rk, tk), frames; xi, atom positions.
`
`representation—for a pairwise description of amino acids to be represent-
`able as a single 3D structure, many constraints must be satisfied including
`the triangle inequality on distances. On the basis of this intuition, we
`arrange the update operations on the pair representation in terms of
`triangles of edges involving three different nodes (Fig. 3c). In particular,
`we add an extra logit bias to axial attention31 to include the ‘missing edge’
`of the triangle and we define a non-attention update operation ‘triangle
`multiplicative update’ that uses two edges to update the missing third
`edge (see Supplementary Methods 1.6.5 for details). The triangle multipli-
`cative update was developed originally as a more symmetric and cheaper
`replacement for the attention, and networks that use only the attention or
`multiplicative update are both able to produce high-accuracy structures.
`However, the combination of the two updates is more accurate.
`We also use a variant of axial attention within the MSA representation.
`During the per-sequence attention in the MSA, we project additional
`logits from the pair stack to bias the MSA attention. This closes the loop
`by providing information flow from the pair representation back into
`the MSA representation, ensuring that the overall Evoformer block is
`able to fully mix information between the pair and MSA representations
`and prepare for structure generation within the structure module.
`
`End-to-end structure prediction
`The structure module (Fig. 3d) operates on a concrete 3D backbone
`structure using the pair representation and the original sequence row
`(single representation) of the MSA representation from the trunk. The
`3D backbone structure is represented as Nres independent rotations
`and translations, each with respect to the global frame (residue gas)
`(Fig. 3e). These rotations and translations—representing the geometry
`of the N-Cα-C atoms—prioritize the orientation of the protein back-
`bone so that the location of the side chain of each residue is highly
`constrained within that frame. Conversely, the peptide bond geometry
`is completely unconstrained and the network is observed to frequently
`violate the chain constraint during the application of the structure mod-
`ule as breaking this constraint enables the local refinement of all parts
`of the chain without solving complex loop closure problems. Satisfac-
`tion of the peptide bond geometry is encouraged during fine-tuning
`by a violation loss term. Exact enforcement of peptide bond geometry
`is only achieved in the post-prediction relaxation of the structure by
`gradient descent in the Amber32 force field. Empirically, this final relaxa-
`tion does not improve the accuracy of the model as measured by the
`
`586 | Nature | Vol 596 | 26 August 2021
`
`Article
`
`PGR2023-00022 Page 00004
`
`

`

`a
`
`Test set of CASP14 domains
`
`Test set of PDB chains
`
`0
`–10
`–20
`GDT difference compared
`with baseline
`
`–4
`
`0
`–2
`lDDT-C difference
`compared with baseline
`
`2
`
`T1024 D1
`T1024 D2
`T1064 D1
`
`With self-distillation training
`
`Baseline
`
`No templates
`
`No auxiliary distogram head
`
`No raw MSA
`(use MSA pairwise frequencies)
`
`No IPA (use direct projection)
`
`No auxiliary masked MSA head
`
`No recycling
`
`No triangles, biasing or gating
`(use axial attention)
`No end-to-end structure gradients
`(keep auxiliary heads)
`
`No IPA and no recycling
`
`b
`
`100
`
`80
`
`60
`
`40
`
`20
`
`0
`
`Domain GDT
`
`0
`
`48
`
`96
`
`144
`
`192
`
`Evoformer block
`
`Fig. 4 | Interpreting the neural network. a, Ablation results on two target sets:
`the CASP14 set of domains (n = 87 protein domains) and the PDB test set of
`chains with template coverage of ≤30% at 30% identity (n = 2,261 protein
`chains). Domains are scored with GDT and chains are scored with lDDT-Cα. The
`ablations are reported as a difference compared with the average of the three
`baseline seeds. Means (points) and 95% bootstrap percentile intervals (error
`bars) are computed using bootstrap estimates of 10,000 samples. b, Domain
`GDT trajectory over 4 recycling iterations and 48 Evoformer blocks on CASP14
`targets LmrP (T1024) and Orf8 (T1064) where D1 and D2 refer to the individual
`domains as defined by the CASP assessment. Both T1024 domains obtain the
`correct structure early in the network, whereas the structure of T1064 changes
`multiple times and requires nearly the full depth of the network to reach the
`final structure. Note, 48 Evoformer blocks comprise one recycling iteration.
`
`Evoformer block—in which each intermediate represents the belief of
`the network of the most likely structure at that block. The resulting
`trajectories are surprisingly smooth after the first few blocks, show-
`ing that AlphaFold makes constant incremental improvements to the
`structure until it can no longer improve (see Fig. 4b for a trajectory of
`accuracy). These trajectories also illustrate the role of network depth.
`For very challenging proteins such as ORF8 of SARS-CoV-2 (T1064),
`the network searches and rearranges secondary structure elements
`for many layers before settling on a good structure. For other proteins
`such as LmrP (T1024), the network finds the final structure within the
`first few layers. Structure trajectories of CASP14 targets T1024, T1044,
`T1064 and T1091 that demonstrate a clear iterative building process
`for a range of protein sizes and difficulties are shown in Supplementary
`Videos 1–4. In Supplementary Methods 1.16 and Supplementary Figs. 12,
`13, we interpret the attention maps produced by AlphaFold layers.
`Figure 4a contains detailed ablations of the components of AlphaFold
`that demonstrate that a variety of different mechanisms contribute
`to AlphaFold accuracy. Detailed descriptions of each ablation model,
`their training details, extended discussion of ablation results and the
`
`Nature | Vol 596 | 26 August 2021 | 587
`
`global distance test (GDT)33 or lDDT-Cα34 but does remove distracting
`stereochemical violations without the loss of accuracy.
`The residue gas representation is updated iteratively in two stages
`(Fig. 3d). First, a geometry-aware attention operation that we term
`‘invariant point attention’ (IPA) is used to update an Nres set of neural
`activations (single representation) without changing the 3D positions,
`then an equivariant update operation is performed on the residue gas
`using the updated activations. The IPA augments each of the usual
`attention queries, keys and values with 3D points that are produced
`in the local frame of each residue such that the final value is invariant
`to global rotations and translations (see Methods ‘IPA’ for details). The
`3D queries and keys also impose a strong spatial/locality bias on the
`attention, which is well-suited to the iterative refinement of the protein
`structure. After each attention operation and element-wise transition
`block, the module computes an update to the rotation and translation
`of each backbone frame. The application of these updates within the
`local frame of each residue makes the overall attention and update
`block an equivariant operation on the residue gas.
`Predictions of side-chain χ angles as well as the final, per-residue
`accuracy of the structure (pLDDT) are computed with small per-residue
`networks on the final activations at the end of the network. The estimate
`of the TM-score (pTM) is obtained from a pairwise error prediction that
`is computed as a linear projection from the final pair representation. The
`final loss (which we term the frame-aligned point error (FAPE) (Fig. 3f))
`compares the predicted atom positions to the true positions under
`many different alignments. For each alignment, defined by aligning
`the predicted frame (Rk, tk) to the corresponding true frame, we com-
`pute the distance of all predicted atom positions xi from the true atom
`positions. The resulting Nframes × Natoms distances are penalized with a
`clamped L1 loss. This creates a strong bias for atoms to be correct relative
`to the local frame of each residue and hence correct with respect to its
`side-chain interactions, as well as providing the main source of chirality
`for AlphaFold (Supplementary Methods 1.9.3 and Supplementary Fig. 9).
`
`Training with labelled and unlabelled data
`The AlphaFold architecture is able to train to high accuracy using only
`supervised learning on PDB data, but we are able to enhance accuracy
`(Fig. 4a) using an approach similar to noisy student self-distillation35.
`In this procedure, we use a trained network to predict the structure of
`around 350,000 diverse sequences from Uniclust3036 and make a new
`dataset of predicted structures filtered to a high-confidence subset. We
`then train the same architecture again from scratch using a mixture of
`PDB data and this new dataset of predicted structures as the training
`data, in which the various training data augmentations such as crop-
`ping and MSA subsampling make it challenging for the network to
`recapitulate the previously predicted structures. This self-distillation
`procedure makes effective use of the unlabelled sequence data and
`considerably improves the accuracy of the resulting network.
`Additionally, we randomly mask out or mutate individual residues
`within the MSA and have a Bidirectional Encoder Representations from
`Transformers (BERT)-style37 objective to predict the masked elements of
`the MSA sequences. This objective encourages the network to learn to
`interpret phylogenetic and covariation relationships without hardcoding
`a particular correlation statistic into the features. The BERT objective is
`trained jointly with the normal PDB structure loss on the same training
`examples and is not pre-trained, in contrast to recent independent work38.
`
`Interpreting the neural network
`To understand how AlphaFold predicts protein structure, we trained
`a separate structure module for each of the 48 Evoformer blocks in
`the network while keeping all parameters of the main network fro-
`zen (Supplementary Methods 1.14). Including our recycling stages,
`this provides a trajectory of 192 intermediate structures—one per full
`
`PGR2023-00022 Page 00005
`
`

`

`b
`
`Coverage < 0.3
`Coverage > 0.6
`
`AlphaFold Experiment
`
`100
`
`101
`103
`102
`Median per-residue Neff for the chain
`
`104
`
`a
`
`100
`
`80
`
`60
`
`40
`
`20
`
`0
`
`lDDT-C
`
`Fig. 5 | Effect of MSA depth and cross-chain contacts. a, Backbone accuracy
`(lDDT-Cα) for the redundancy-reduced set of the PDB after our training data
`cut-off, restricting to proteins in which at most 25% of the long-range contacts
`are between different heteromer chains. We further consider two groups of
`proteins based on template coverage at 30% sequence identity: covering more
`than 60% of the chain (n = 6,743 protein chains) and covering less than 30% of
`the chain (n = 1,596 protein chains). MSA depth is computed by counting the
`
`number of non-gap residues for each position in the MSA (using the Neff
`weighting scheme; see Methods for details) and taking the median across
`residues. The curves are obtained through Gaussian kernel average smoothing
`(window size is 0.2 units in log10(Neff)); the shaded area is the 95% confidence
`interval estimated using bootstrap of 10,000 samples. b, An intertwined
`homotrimer (PDB 6SK0) is correctly predicted without input stoichiometry
`and only a weak template (blue is predicted and green is experimental).
`
`effect of MSA depth on each ablation are provided in Supplementary
`Methods 1.13 and Supplementary Fig. 10.
`
`MSA depth and cross-chain contacts
`Although AlphaFold has a high accuracy across the vast majority of
`deposited PDB structures, we note that there are still factors that affect
`accuracy or limit the applicability of the model. The model uses MSAs
`and the accuracy decreases substantially when the median alignment
`depth is less than around 30 sequences (see Fig. 5a for details). We
`observe a threshold effect where improvements in MSA depth over
`around 100 sequences lead to small gains. We hypothesize that the MSA
`information is needed to coarsely find the correct structure within the
`early stages of the network, but refinement of that prediction into a
`high-accuracy model does not depend crucially on the MSA information.
`The other substantial limitation that we have observed is that AlphaFold
`is much weaker for proteins that have few intra-chain or homotypic con-
`tacts compared to the number of heterotypic contacts (further details
`are provided in a companion paper39). This typically occurs for bridging
`domains within larger complexes in which the shape of the protein is
`created almost entirely by interactions with other chains in the complex.
`Conversely, AlphaFold is often able to give high-accuracy predictions for
`homomers, even when the chains are substantially intertwined (Fig. 5b).
`We expect that the ideas of AlphaFold are readily applicable to predicting
`full hetero-complexes in a future system and that this will remove the dif-
`ficulty with protein chains that have a large number of hetero-contacts.
`
`Related work
`The prediction of protein structures has had a long and varied develop-
`ment, which is extensively covered in a number of reviews14,40–43. Despite
`the lon

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket