throbber
Fifth
`
`TerryG,asterlatul, Peler~Kevin~
`Christos Ouzouni.s, C1tris Sander, tJ,, Afonso Valenda
`&ilots
`
`

`

`ISMB-97
`
`Proceedings
`Fifth International Conference
`on Intelligent Systems
`for Molecular Biology
`
`Edited By
`
`Terry Gaasterland, Peter Karp, Kevin Karplus,
`Christos Ouzounis, Chris Sander, & Alfonso Valencia
`
`AAAI Press
`Menlo Park, California
`
`

`

`,·•-
`L n~ r ·
`G
`
`Copyright © 1997
`American Association for Artificial Intelligence
`
`MAI Press
`445 Burgess Drive
`Menlo Park, California 94025
`
`All rights reserved. No part of this book may be reproduced in any
`form by any electronic or mechanical means (including photocopying,
`recording, or information storage and retrieval) without permission in
`writing from the publisher.
`
`ISBN 0-1-57735-022-7
`
`Printed on acid-free paper.
`
`The cover illustration depicts the metabolic-pathway network of
`Escherichia coli. Circles represent chemical compounds, short lines
`represent bioreactions, and long lines link multiple occurrences of
`the same compound.
`
`Manufactured in the United States of America
`
`

`

`Inari v. Corteva - PGR2023-00022
`
`Corteva Exhibit 2029 - Page 4 of 16
`
`

`

`Contents
`
`ISMB-97 Organization / ix
`ISMB-97 Sponsoring Organizations / x
`Preface/ xi
`
`PAPERS
`Increasing Consensus Accuracy in DNA Fragment Assemblies by
`Incorporating Fluorescent Trace Representations/ 3
`Carolyn F. Allex, Schuyler F. Baldwin, Jude W. Shavlik, and Frederick R. Blattner
`
`Standardized Representations of the Literature: Combining Diverse Sources of Ribosomal Data/ 15
`Russ B. Altman, Neil F. Abernethy, and Richard 0. Chen
`
`Automatic Annotation for Biological Sequences by Extraction of Keywords from MED LINE Abstracts:
`Development of a Prototype System / 25
`Miguel A. Andrade and Alfonso Valencia
`
`Protein Sequence Annotation in the Genome Era: The Annotation Concept ofSWISS-PROT +TREMEL/ 33
`Rolf Apweiler, Alain Gateau, Sergio Contrino, Maria Jesus Martin, Vivien Junker, Claire O'Donovan, Fiona Lang, Nicoletta
`Mitaritonna, Stephanie Kappus, and Amos Bairoch
`
`Self-Organizing Neural Maps of the Coding Sequences of G-Protein-Coupled Receptors Reveal Local
`Domains Associated with Potentially Functional Determinants in the Proteins / 44
`P. Arrigo, P. Fariselli, and R. Casadio
`
`Beta-Sheet Prediction Using Inter-Strand Residue Pairs and Refinement with Hopfield Neural Network/ 48
`Minoru Asogawa
`
`Code Generation through Annotation of Macromolecular Structure Data/ 52
`John Biggs, Calton Pu, and Philip Bourne
`
`Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods
`Used in Sequence Comparison / 56
`Ewan Birney and Richard Durbin
`
`Data Mining for Regulatory Elements in Yeast Genome / 65
`Alvis Brazma, J aak Vilo, Esko Ukkonen, and Kim mo Valtonen
`
`Application of Genetic Search in Derivation of Matrix Models of Peptide Binding to MHC Molecules/ 75
`Vladimir Brusic, Christian Schonbach, Masafumi Takiguchi, Vic Ciesielski, and Leonard C. Harrison
`
`

`

`RIBOWEB: Linking Structural Computations to a Knowledge Base of Published Experimental Data/ 84
`Richard O. Chen, Ramon Felciano, and Russ B. Altman
`
`Density of States, Metastable States, and Saddle Points: Exploring the Energy Landscape of an RNA Molecule 188
`Jan Cupal, Christoph Flamm, Alexander Renner, and Peter F. Stadler
`
`Prediction of Enzyme Classification from Protein Sequence without the Use of Sequence Similarity/ 92
`Marie des Jardins, Peter D. Karp, Markus Krummenacker, Thomas f. Lee, and Christos A. Ouzotmis
`
`Incorporating Global Information into Secondary Structure Prediction
`with Hidden Markov Models of Protein Folds/ l 00
`Valentina Di Francesco, Philip McQueen, Jean Garnier, and Peter]. Munson
`
`Protein Folding Class Predictor for SCOP: Approach Based on Global Descriptors/ I 04
`Inna Dubchak, Ilya Muchnik, and Sung-Hou Kim
`
`A Knowledge Base for D. melanogaster Gene Interactions Involved in Pattern Formation / I 08
`Jerome Euzenat, Christophe Chemla, and Bernard Jacq
`
`Finding Common Sequence and Structure Motifs in a Set of RNA Sequences/ 120
`Jan Gorodkin, Laurie]. Heyer, and Gary D. Stormo
`
`Domain Identification by Clustering Sequence Alignments/ 124
`Xiaojun Guan
`
`RIFLE: Rapid Identification of Microorganisms by Fragment Length Evaluation/ 131
`Henning Hermjakob, Robert Giegerich, and Walter Arnold
`
`Decision Support System for the Evolutionary Classification of Protein Structures/ 140
`Liisa Holm and Chris Sander
`
`Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier/ 147
`Paul Horton and Kenta Nakai
`•
`
`Identifying Chimerism in Proteins Using Hidden Markov Models of Codon Usage/ 153
`Lawrence Hunter and Barry Zeeberg
`
`The Context-Dependence of Amino Acid Properties/ 157
`Thomas R. Joerger
`
`Detection of Distant Structural Similarities in a Set of Proteins Using a Fast Graph-Based Method/ 167
`Ina Koch and Thomas Lengauer
`
`Two Methods for Improving Performance of a HMM and their Application for Gene Finding / 179
`Anders Krogh
`
`ANGLEA: A WWW Server to Assess Protein Structures/ 187
`Francisco Melo, Damien Devos, Eric Depiereux, and Ernest Feytmans
`
`A Fast Heuristic Algorithm for a Probe Mapping Problem / 191
`Brendan Mumey
`
`vi
`
`

`

`Multi-Body Interactions within the Graph of Protein Structure/ 198
`Peter J. Munson and Raj K. Singh
`
`Enumerating and Ranking Discrete Motifs/ 202
`Craig G. Nevill-Manning, Koma/ S. Sethi, Thomas D. Wu, and Douglas L. Brutlag
`
`Selecting Optimal Oligonucleotide Primers for Multiplex PCR / 210
`Pierre Nicodeme and Jean-Marc Steyaert
`
`PDB-REPRDB: A Database of Representative Protein Chains in PDB (Protein Data Bank)/ 214
`Tamotsu Noguchi, Kentaro Onizuka, Yutaka Akiyama, and Minoru Saito
`
`Automatic Construction of Knowledge Base from Biological Papers/ 218
`Yoshihiro Ohta, Yasunori Yamamoto, Tomoko Okazaki, Ikuo Uchiyama, and Toshihisa Takagi
`
`Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives
`for EST and Genome Analysis I 226
`Anders Gorm Pedersen and Henrik Nielsen
`
`Large Scale Protein Modeling and Model Repository/ 234
`Manuel C. Peitsch
`
`Modeling Antibody Side Chain Conformations Using Heuristic Database Search/ 237
`David W. Ritchie and Graham J. L. Kemp
`
`Novel Techniques for Visualizing Biological Information/ 241
`Alan J. Robinson and Tomas P. Flores
`
`A CORBA Server for the Radiation Hybrid Database/ 250
`P. Rodriguez-Tome, C. Helgesen, P. Lijnzaad, and K. Jungfer
`
`Extraction of Substructures of Proteins Essential to their Biological Functions by a Data Mining Technique/ 254
`Kenji Satou, Toshihide Ono, Yoshihisa Yamamura, Emiko Furuichi, Satoru Kuhara, and Toshihisa Takagi
`
`CARTHAGENE: Constructing and Joining Maximum Likelihood Genetic Maps/ 258
`Thomas Schiex and Christine Gaspin
`
`Modeling Transcription Factor Binding Sites with Gibbs Sampling and
`Minimum Description Length Encoding / 268
`Jonathan Schug and G. Christian Overton
`
`Adding Semantics to Genome Databases: Towards an Ontology for Molecular Biology/ 272
`Steffen Schulze-Kremer
`
`Predicting Enzyme Function from Sequence: A Systematic Appraisal/ 276
`Imran Shah and Lawrence Hunter
`
`Hierarchical Protein Structure Superposition Using Both Secondary Structure and Atomic Representations/ 284
`Amit P. Singh and Douglas L. Brutlag
`
`vii
`
`

`

`The Gene-Finder Computer Tools for Analysis of Human and Model Organisms Genome Sequences/ 294
`Vidor Solovyev and Asaf Salamov
`Generating Benchmarks for Multiple Sequence Alignments and Phylogenic Reconstructions/ 303
`Jens Stoye, Dirk Evers, and Folker Meyer
`
`Protein Model Representation and Construction / 307
`M. Sullivan,]. Glasgow, E. Steeg, L. Leherte, and S. Fortier
`
`Automated Alignment of RNA Sequences to Pseudoknotted Structures I 311
`Jack E. Tabaska and Gary D. Stormo
`
`Inference of Molecular Phylogenetic Tree Based on Minimum Model-Based Complexity Method/ 319
`Hiroshi Tanaka, F. Ren, T. Okayama, and T. Gojobori
`
`A New Plug-In Software Architecture Applied for a Portable Molecular Structure Browser/ 329
`Yutaka Ueno and Kiyoshi Asai
`
`SEALS: A System for Easy Analysis of Lots of Sequences / 333
`D. Roland Walker and Eugene V. Koonin
`
`Better Cutters for Protein Mass Fingerprinting: Preliminary Findings / 340
`Michael J. Wise, Tim Littlejohn, and Ian Humphery-Smith
`
`Inferring Gene Structures in Genomic Sequences Using Pattern Recognition and Expressed Sequence Tags / 344
`Ying Xu, Richard J. Mural, and Edward C. Uberbacher
`
`Functional Prediction of B. subtilis Genes from Their Regulatory Sequences/ 354
`Tetsushi Yada, Yasushi Totoki, Takahiro Ishii, and Kenta Nakai
`
`Bayesian Adaptive Alignment and Inference/ 358
`Jun Zhu, Jun Liu, and Charles Lawrence
`
`Index/ 369
`
`viii
`
`

`

`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`Prediction of Enzyme Classification from Protein Sequence without
`the use of Sequence Similarity
`Marie desJ ardins
`Peter D. Karp
`Markus Krummenacker
`Thomas J . Lee
`Christos A. Ouzounis+
`
`SRI International, 333 Ravenswood Avenue, Menlo
`Park CA 94025, USA, pkarp@ai.sri.com
`+ Current Address: The European Bioinformatics In(cid:173)
`stitute, EMBL Outstation, Wellcome Trust Genome
`Campus, Cambridge UK CBI0 !SD
`
`Abstract 1
`We describe a novel approach for predicting the func(cid:173)
`tion of a protein from its amino-acid sequence. Given
`features that can be computed from the amino-acid se(cid:173)
`quence in a straightforward fashion (such as pl, molec(cid:173)
`ular weight, and amino-acid composition), the tech(cid:173)
`Is the
`nique allows us to answer questions such as:
`protein an enzyme? If so, in which Enzyme Commis(cid:173)
`sion (EC) class does it belong? Our approach uses
`machine learning (ML) techniques to induce classifiers
`that predict the EC class of an enzyme from features
`extracted from its primary sequence. We report on a
`variety of experiments in which we explored the use
`of three different ML techniques in conjunction with
`training datasets derived from PDB and from Swiss(cid:173)
`Prot. We also explored the use of several different fea(cid:173)
`ture sets. Our method is able to predict the first EC
`number of an enzyme with 74% accuracy (thereby as(cid:173)
`signing the enzyme to one of six broad categories of
`enzyme function), and to predict the second EC num(cid:173)
`ber of an enzyme with 68% accuracy ( thereby assign(cid:173)
`ing the enzyme to one of 57 subcategories of enzyme
`function). This technique could be a valuable comple(cid:173)
`ment to sequence-similarity searches and to pathway(cid:173)
`analysis methods.
`
`Introduction
`The most successful technique for identifying the possi(cid:173)
`ble function of anonymous gene products such as those
`generated by genome projects is performing sequence-
`1Copyright © 1997, American Association for Artificial
`Intelligence (www.aaai .org). All rights reserved .
`
`92
`
`ISMB-97
`
`similarity searches against the sequence databas
`(DBs) . Putative functions are assigned on the b:
`sis of the closest similarity of the query sequence t
`proteins of known function. These techniques havo
`achieved a high level of performance: more than 60%
`of H. influenzae ( Casari et al. 1995) and around 403/c
`of M. jannashii (NC et al. 1996) open reading frame;
`(ORFs) have been assigned a specific biochemical func(cid:173)
`tion, at varying degrees of confidence. However, man
`unidentified genes remain in those genomes, and th~
`only way that functional predictions can increase is by
`r~peating t~e searches against larger ( and hopefully
`richer) versions of the sequence DBs. For unique pro(cid:173)
`teins, or large families of hypothetical ORFs, function
`remains unknown , with the current similarity-based
`methodology.
`W: ~ave develop~ a novel approach for accurately
`pred1ctmg the funct10n of a protein from its predicted
`amino-acid sequence, based on the Enzyme Commis(cid:173)
`sion (EC) classification hierarchy (Webb 1992) . Given
`features that can be computed from the amino-acid
`sequence in a straightforward fashion, the technique
`allows us to answer questions such as: Is the protein
`an enzyme? If so, in which EC class does it belong?
`Our approach uses machine learning (ML) tech(cid:173)
`niques to induce classifiers that predict the EC class
`of an enzyme from features extracted from its primary
`sequence. We report on a variety of experiments in
`which we explored the use of three different ML tech(cid:173)
`niques in conjunction with training datasets derived
`from PDB and from Swiss-Prot. We also explored the
`use of several different feature sets.
`Problem Definition
`The aim of this work is to produce classifier programs
`that predict protein function based on features that
`can be derived from amino-acid sequence, or a 3-D
`structure. The classifiers will predict whether the pro(cid:173)
`tein is an enzyme, as opposed to performing some other
`If the protein is an enzyme, we would
`cellular role.
`
`

`

`prefer to know its exact activity: however, we have as(cid:173)
`sumed that learning to predict exact activities is too
`difficult a problem, partly because sufficient training
`data is not available. We therefore focus on the prob(cid:173)
`lem of predicting the general class of activity of an
`enzyme, which can also be valuable information.
`Our work makes use of the EC hierarchy. This clas(cid:173)
`sification system organizes many known enzymatic ac(cid:173)
`tivities into a four-tiered hierarchy that consists of 6
`top-level classes, 57 second-level classes, and 197 third(cid:173)
`level classes; the fourth level comprises approximately
`3,000 instances of enzyme function. The organizing
`principle of the classification system is to group to(cid:173)
`gether enzymatic activities that accomplish chemically
`similar transformations. The central assumption un(cid:173)
`derlying our work is that proteins that catalyze re(cid:173)
`actions that are similar within the EC classification
`scheme will also have similar physical properties.
`We constructed classifiers that solve three different
`problems:
`
`• Level-0 problem: Is the protein an enzyme?
`
`• Level-I problem: If the protein is an enzyme, in
`which of the 6 first-level EC classes does its reaction
`belong?
`
`• Level-2 problem: If the protein is an enzyme, in
`which of the 57 second-level EC classes does its re(cid:173)
`action belong?
`
`For each prediction problem we ran several machine(cid:173)
`learning algorithms to examine which performed best.
`We also employed several different training datasets for
`each prediction problem to determine what features are
`most informative, and to explore the sensitivity of the
`method to different distributions of the training data.
`The only similar work we are aware of is Wu's work
`on learning descriptions of PIR protein superfa.milies
`using neural networks (CH 1996).
`
`Methods
`Our methodology for applying ML to the enzyme clas(cid:173)
`sification problem was as follows:
`
`1. Characterize the classification problem, and identify
`the characteristics of this problem that would influ(cid:173)
`ence the choice of an appropriate ML method.
`
`2. Select one or more ML methods to apply to the clas(cid:173)
`sification problem.
`
`3. Create a small dataset from available data sources.
`
`4. Run the selected ML methods on the small dataset.
`
`5. Evaluate the results, and make changes to the exper(cid:173)
`imental setup by (a) Reformulating the classification
`problem (e.g., adding new prediction classes), (b)
`Eliminating noisy or problem data points from the
`dataset, (c) Eliminating redundant or useless fea(cid:173)
`tures, or adding new features to the data, (d) Adding
`or deleting ML methods from the "toolkit" of meth(cid:173)
`ods to be applied
`
`6. When the above process is complete, create a larger
`alternative dataset, run the selected ML methods,
`and evaluate the results.
`
`7. Evaluate the results on all datasets, with all ML
`methods, with respect to the baseline test of
`a sequence-similarity search, currently the most
`widely used method of approaching this problem (P,
`C, & C 1994).
`
`We started with a small dataset to familiarize our(cid:173)
`selves with the domain, identify the features to be
`learned, and provide a testing ground for exploring
`the space of experimental setups, before scaling up
`to larger datasets. These larger datasets served to
`check the generality and scalability of the experimen(cid:173)
`tal approach in real-world situations. The sequence(cid:173)
`similarity baseline provides a means of assessing the
`overall performance of the approach: Do ML methods
`make better predictions than sequence similarity? Are
`there some classes or particular cases for which ML
`methods perform better or worse?
`Problem characteristics The features in this do(cid:173)
`main are mostly numerical attributes, so algorithms
`that are primarily designed to operate on symbolic at(cid:173)
`tributes are inappropriate. The prediction problem is a
`multiclass learning problem (e.g., there are 6 top-level
`EC classes and 57 second-level EC classes to predict),
`for which not all learning algorithms are suited. The
`features are not independent (e.g., the sum of the nor(cid:173)
`malized proportions of the amino acids will always be
`one), so algorithms that rely heavily on independent
`features may not work well. Most important, there
`may be noisy data (incorrect or missing feature values
`or class values), and we do not expect to be able to
`learn a classifier that predicts the EC class with per(cid:173)
`fect accuracy, so the algorithm must be able to handle
`noise. Such examples are sequence entries that are
`fragments but do have an assigned EC number, or real
`enzymes with no EC numbers assigned to them.
`ML methods Based on the above problem char(cid:173)
`acteristics, we selected three learning algorithms: dis(cid:173)
`cretized naive Bayes (DNB), C4.5, and Instance-Based
`Learning (IBL).
`Discretized naive Bayes (J, R, & M 1995) is a simple
`algorithm that stores all the training instances in bins
`
`Des Jardins
`
`93
`
`

`

`according to their (discretized) feature values. The
`algorithm assumes that the features are independent
`given the value of the class. To make a prediction, it
`does a table lookup for each feature value to determine
`the associated probability of each class, given the fea(cid:173)
`ture's value, and combines them using Bayes' rule to
`make an overall prediction.
`C4.5 (JR 1993) induces classifiers in the form of de(cid:173)
`cision trees, by recursively splitting the set of training
`examples by feature values. An information-theoretic
`measure is applied at each node in the tree to deter(cid:173)
`mine which feature best divides the subset of examples
`covered by that node. Following the tree construction
`process, a pruning step is applied to remove branches
`that have low estimated predictive performance.
`The term instance-based learning (IBL) covers a
`class of algorithms that store the training instances,
`and make predictions on a new instance J by retrieving
`the nearest instance N (according to some similarity
`metric over the feature space) and then returning the
`class of N as the class of I ( or by making a weighted
`prediction from a set of nearest instances) (Aha DW
`1991).
`Feature engineering Feature engineering, or the
`problem of identifying an appropriate set of features
`and feature values to characterize a learning problem,
`is a critical problem in real-world applications of ML
`algorithms. This process frequently represents a sub(cid:173)
`stantial part of the time spent on developing the appli(cid:173)
`cation, and this project was no exception. Section de(cid:173)
`scribes the features we identified, and their biochemical
`significance. Section discusses the process by which
`we identified and removed redundant features. Sec(cid:173)
`tion gives the results for the alternative datasets and
`feature sets that were explored.
`Large datasets We used extended datasets for the
`final evaluation of the ML methods on the enzyme clas(cid:173)
`sification problem. We created several versions of these
`a full Swiss-Prot version, and several "bal(cid:173)
`datasets -
`anced" datasets that contain a random sampling of
`the proteins in the Swiss-Prot DB, selected to have a
`class distribution ( of enzymes vs. non-enzymes) sim(cid:173)
`ilar to the PDB dataset in the first case, and to the
`distribution of enzymes versus non-enzymes in com(cid:173)
`plete genomes in the second case.
`Sequence similarity The predictions using ML
`have been compared with function assignments made
`through sequence similarity. We used BLAST
`(Altschul et al. 1990) with standard search param(cid:173)
`eters and a special filtering procedure {unpublished),
`against the equivalent datasets from the ML experi(cid:173)
`ments. Query sequences (with or without an EC num(cid:173)
`ber) were predicted to have the EC number of the dos-
`
`94
`
`ISMB-97
`
`est homologue (if applicable). Only significant homolo(cid:173)
`gies were considered, with a default cut-off P-value of
`10-6 and careful manual evaluation of the DB search
`In this manner, we have obtained an ace _
`results.
`racy estimate for the similarity-based methods. It~
`1s
`h
`.
`.
`mterestmg to note t at such an experiment is to
`our
`,
`.
`knowledge, umque.
`
`Features
`For the core set of features used as inputs to the ML
`programs, we used properties that can be directly com(cid:173)
`puted from primary sequence information, so they can
`be used ~or predicting the function of ORFs whose
`structu~e 1s a~so unknown. Those features are length of
`the ammo acid sequence, the molecular weight mw of
`the sequence, and the amino acid composition, repre(cid:173)
`sented as 20 values {pa pc pd pe pf pg ph pi pk pl pm pn
`pp pq pr ps pt pv pw PY} in the range from O to 1, each
`value standing for the respective residue frequency as
`a fraction of the total sequence length.
`The feature charge was computed by summing the
`contributions of charged amino acids. The features
`ip (isoelectric point) and extinction coefficient were
`calculated by the program "Peptidesort" (Peptides(cid:173)
`ort is from the GCG package, version 8.0-Open VMS ,
`September 1994).
`The secondary structural features helix, strand, and
`turn, which we used for one experiment, were ex(cid:173)
`tracted from information in the FT fields of Swiss-Prot
`records. For all such lines with a HELIX, STRAND, or
`TURN keyword, the numbers of amino acids between
`the indicated positions were summed up, to calculate
`the total percentages of amino acids that are part of
`these structures, respectively. We included this infor(cid:173)
`mation, since it was available for the proteins in the
`PDB, to see how well it would improve the prediction
`quality of the learned classifiers if it were available for
`an unknown protein. Secondary structure can be esti(cid:173)
`mated from the primary sequence ( although not with
`perfect accuracy), and using this estimated secondary
`structure might be worthwhile in making predictions
`if secondary structure proved to be a strong enough
`predictor of enzyme class.
`
`Datasets
`We obtained EC classes from version 21.0 of the EN(cid:173)
`ZYME DB (Bairoch 1996). We prepared datasets de(cid:173)
`rived from the PDB and Swiss-Prot.
`Dataset 1: This family of datasets originated from
`the PDB subset of Swiss-Prot,2 containing 999 entries.
`Features for these protein sequences were calculated as
`2See ftp: //expasy .hcuge. ch/databases/Sviss-Prot
`/special..selections/pdb.seq.180496
`
`

`

`in Section 3.1. EC numbers were extracted from the
`text string in the (DE) field of the Swiss-Prot records
`(more than one EC can occur in one entry). We cre(cid:173)
`ated several variants of this dataset, containing differ(cid:173)
`ent features.
`Dataset la: Features:
`{ length mw charge ip extinction pa pc pd pe pf pg ph
`pi pk pl pm pn pp pq pr ps pt pv pw py}
`Dataset lb: We dropped the ip and extinction/mw
`features, because they were strongly correlated with
`charge and length, respectively. Features:
`{ length charge pa pc pd pe pf pg ph pi pk pl pm pn
`pp pq pr ps pt pv pw py}
`Dataset le: We reduced the feature set fur(cid:173)
`ther by combining the composition percentages of
`amino acids with similar biochemical properties (WR
`1986). The following subsets were grouped together:
`ag c de fwy h ilv kr m nq p st, reducing the num(cid:173)
`ber of amino acid composition features from twenty to
`eleven. Features:
`{ length mw charge pag pc pde pfwy ph pilv pkr pm
`pnq pp pst}
`Dataset ld: Three new secondary structural fea(cid:173)
`tures were added to the features in 1 b. Features:
`{ length mw charge pa pc pd pe pf pg ph pi pk pl pm
`pn pp pq pr ps pt pv pw py helix strand turn}
`Dataset 2: The raw data originated from the full
`release of Swiss-Prot version 33 (Bairoch & Boeck(cid:173)
`mann 1994), containing 52205 entries. Features were
`computed using the "aacomp" program (aacomp is
`part of the FASTA package). Secondary structural fea(cid:173)
`tures were omitted, because only a small minority of
`entries carry this information. Feature set:
`{ length charge pa pc pd pe pf pg ph pi pk pl pm pn
`pp pq pr ps pt pv pw py}
`
`Characterization of the Data
`Table 1 provides a numerical overview of the entries
`in the datasets. There is a notable difference between
`Datasets 1 and 2 in the percentage of entries that have
`an EC-number, perhaps because enzymes are a com(cid:173)
`mon object of study by crystallographers. Dataset 2 is
`probably closer to the natural enzyme vs. non-enzyme
`distribution in the protein universe.
`A few entries have more than one EC number, for ex(cid:173)
`ample, multifunctional or multidomain enzymes. We
`have excluded all these cases from the final dataset,
`on the assumption that they will introduce noise in
`the EC-classification experiments. Entries without any
`EC number are presumed not to be enzymes. How(cid:173)
`ever, we could envision data entry errors of omission
`that would violate this assumption. We performed a
`search for the string "ase" in the DE field of Swiss-
`
`Prot records that lack EC numbers to find potential
`mis-annotated enzymes. This search did pull out quite
`a few hits, which could act as false positives in the non(cid:173)
`EC class. Dataset 2 contained too many such cases for
`a hand-analysis, but the Dataset 1 cases were exam(cid:173)
`ined. About half of the cases were enzyme-inhibitors,
`some were enzyme-precursors, and a few entries did
`seem to be enzymes.
`
`Results
`We present three groups of results. The sequence of
`experiments reflects our exploration of various subsets
`of features for the prediction problems under study.
`The first group involves the PDB datasets; the second
`group involves the Swiss-Prat dataset. We explored
`how well the learning algorithms scaled with training
`set size and composition. The third group compares
`the results of the learning experiments with sequence
`similarity -
`a mature technique for function predic(cid:173)
`tion.
`Learning experiments were conducted by preprocess(cid:173)
`ing the dataset of interest to produce three different
`input files, one each for the level 0, level 1, and level
`2 prediction problems. Preprocessing consisted of ex(cid:173)
`cluding non-enzymes from the level 1 and level 2 files,
`and formatting the class appropriately for the prob(cid:173)
`lem, that is, a binary value for level 0, one of six values
`for level 1, and one of 57 values for level 2 (actually
`51, since the datasets only represented 51 of the 57
`possible two-place EC numbers). We omitted level 2
`experiments with the PDB datasets because 500 train(cid:173)
`ing instances are too few to learn 51 classes.
`Each experiment consisted of performing a tenfold
`cross-validation using a random 90:10% partition of the
`data into a training set and test set, respectively. A
`suite of three experiments was run for each input file,
`one for each of the three learning algorithms DNB,
`C4.5, and IB. Results are reported as percentage of
`test set instances correctly classified by each algorithm,
`averaged over the ten cross-validation runs.
`Experiments were conducted using MLC++ version
`1.2 (R & D 1995), a package of ML utilities. All exper(cid:173)
`iments were run on a Sun Microsystems Ultra-I work(cid:173)
`station with 128 MB of memory, under Solaris 2.5.1.
`
`Results for the PDB Datasets
`Experiments involving Dataset 1 are shown in Ta(cid:173)
`ble 2. Dataset la provides a baseline. The results for
`Dataset 1 b show that the features extinction, isoelec(cid:173)
`tric point, and molecular weight are redundant since
`each is strongly correlated with either charge or length
`( a principal component analysis of the feature sets also
`confirmed this fact - not shown). Those features were
`
`Des Jardins
`
`95
`
`

`

`entries with exactly one EC number:
`entries without EC number:
`entries with multiple EC numbers (not used):
`total number of entries in the raw dataset:
`
`Dataset 1
`
`Dataset 2
`
`416 (41.6%)
`565 (56.6%)
`18 (1.8%)
`999
`
`14709 (28.2%)
`36997 (70.9%)
`499 (1.0%)
`52205
`
`entries with no EC number but "ase" in name:
`
`35 (3.5%)
`
`2156 (4.1%)
`
`Table 1: Summary of data characteristics
`
`Dataset and Features Used
`(la) Initial features
`{la) Initial features
`(lb) Nonredundant features
`{lb) Nonredundant features
`(le) Amino acid grouping
`(le) Amino acid grouping
`(ld) Structural, unknowns
`(ld) Structural, unknowns
`{ld) Structural, no unknowns
`{ld) Structural, no unknowns
`
`Problem
`level 0
`level 1
`level 0
`level 1
`level 0
`level 1
`level 0
`level 1
`level 0
`level 1
`
`Instances
`980
`416
`980
`416
`980
`416
`980
`416
`630
`266
`
`DNB C4.5
`IB
`79.29 76.63 78.37
`60.10 48.54 50.53
`76.33 78.98
`77.96
`62.74 48.54 48.64
`74.59 74.49 74.08
`57.46 45.90 46.63
`77.48 77.98 76.97
`55.06 47.64 49.59
`81.59 80.16 76.19
`63.50 47.35 48.52
`
`Table 2: Classification accuracies on the PDB dataset for various representations
`
`excluded from future experiments.
`Experiment le asks whether accuracy can be im(cid:173)
`proved by creating new features that group amino acids
`according to their biochemical properties. It is surpris(cid:173)
`ing that the results with this representation were uni(cid:173)
`versally worse. It is likely that useful information is
`lost with this reduction. We concluded that, since pre(cid:173)
`diction was better with more features, we had not yet
`reached an upper bound on feature set size and could
`effectively add features without overwhelming any of
`the learning algorithms. We have not yet explored
`other groupings of amino acids.
`Our next experiments added secondary structure in(cid:173)
`formation to the feature set by including the helix,
`strand, and turn features. Because the values for these
`features were not available for a high proportion ( over
`50%) of instances in PDB, we conducted two suites of
`experiments. The first suite used all instances in our
`PDB dataset, and annotated missing structure features
`as unknown values. The second suite excluded all in(cid:173)
`stances for which structural data was missing. With
`unknowns excluded, accuracy did improve somewhat
`with the addition of structure features. However, the
`improvement is rather small. The value of structural
`composition is unclear, and further exploration has
`been left for future work.
`
`96
`
`ISMB-97
`
`Results from the Swiss-Prot Dataset
`
`We conducted experiments using two subsets of Swiss(cid:173)
`Prot, as well as with the full dataset and with some
`class-balanced datasets. Results are listed in Ta(cid:173)
`ble 3. The first was a yeast subset, consisting of
`all instances for the species Saccharomyces cerevisiae
`(baker's yeast) and Schizosaccharomyces pombe (fis(cid:173)
`sion yeast). The second was a prokaryotic subset,
`consisting of all instances for the species E. coli,
`Salmonella typhimurium, Azotobacter vinelandii, Azo(cid:173)
`tobacter chroococcum, Pseudomonas aeruginosa, Pseu(cid:173)
`domonas putida, Haemophilus influenzae, and various
`Bacillus species.
`As was observed with the PDB datasets, the IB al(cid:173)
`gorithm performs the best overall. Although the other
`two algorithms are comparable for the simpler level
`0 problems, they degrade substantially more than IB
`does as the number of classses increases. It also ap(cid:173)
`pears as if IB improves generally, though not univer(cid:173)
`sally, as the number of training instances increases.
`This is most apparent in the 67.6% accuracy it at(cid:173)
`tains for the full 51-class problem. This is an en(cid:173)
`couraging trend, lending hope that classification ac(cid:173)
`curacy will improve as more sequence data becomes
`available. IB consumes substantial machine resources
`during the training p

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket