`
`TerryG,asterlatul, Peler~Kevin~
`Christos Ouzouni.s, C1tris Sander, tJ,, Afonso Valenda
`&ilots
`
`
`
`ISMB-97
`
`Proceedings
`Fifth International Conference
`on Intelligent Systems
`for Molecular Biology
`
`Edited By
`
`Terry Gaasterland, Peter Karp, Kevin Karplus,
`Christos Ouzounis, Chris Sander, & Alfonso Valencia
`
`AAAI Press
`Menlo Park, California
`
`
`
`,·•-
`L n~ r ·
`G
`
`Copyright © 1997
`American Association for Artificial Intelligence
`
`MAI Press
`445 Burgess Drive
`Menlo Park, California 94025
`
`All rights reserved. No part of this book may be reproduced in any
`form by any electronic or mechanical means (including photocopying,
`recording, or information storage and retrieval) without permission in
`writing from the publisher.
`
`ISBN 0-1-57735-022-7
`
`Printed on acid-free paper.
`
`The cover illustration depicts the metabolic-pathway network of
`Escherichia coli. Circles represent chemical compounds, short lines
`represent bioreactions, and long lines link multiple occurrences of
`the same compound.
`
`Manufactured in the United States of America
`
`
`
`Inari v. Corteva - PGR2023-00022
`
`Corteva Exhibit 2029 - Page 4 of 16
`
`
`
`Contents
`
`ISMB-97 Organization / ix
`ISMB-97 Sponsoring Organizations / x
`Preface/ xi
`
`PAPERS
`Increasing Consensus Accuracy in DNA Fragment Assemblies by
`Incorporating Fluorescent Trace Representations/ 3
`Carolyn F. Allex, Schuyler F. Baldwin, Jude W. Shavlik, and Frederick R. Blattner
`
`Standardized Representations of the Literature: Combining Diverse Sources of Ribosomal Data/ 15
`Russ B. Altman, Neil F. Abernethy, and Richard 0. Chen
`
`Automatic Annotation for Biological Sequences by Extraction of Keywords from MED LINE Abstracts:
`Development of a Prototype System / 25
`Miguel A. Andrade and Alfonso Valencia
`
`Protein Sequence Annotation in the Genome Era: The Annotation Concept ofSWISS-PROT +TREMEL/ 33
`Rolf Apweiler, Alain Gateau, Sergio Contrino, Maria Jesus Martin, Vivien Junker, Claire O'Donovan, Fiona Lang, Nicoletta
`Mitaritonna, Stephanie Kappus, and Amos Bairoch
`
`Self-Organizing Neural Maps of the Coding Sequences of G-Protein-Coupled Receptors Reveal Local
`Domains Associated with Potentially Functional Determinants in the Proteins / 44
`P. Arrigo, P. Fariselli, and R. Casadio
`
`Beta-Sheet Prediction Using Inter-Strand Residue Pairs and Refinement with Hopfield Neural Network/ 48
`Minoru Asogawa
`
`Code Generation through Annotation of Macromolecular Structure Data/ 52
`John Biggs, Calton Pu, and Philip Bourne
`
`Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods
`Used in Sequence Comparison / 56
`Ewan Birney and Richard Durbin
`
`Data Mining for Regulatory Elements in Yeast Genome / 65
`Alvis Brazma, J aak Vilo, Esko Ukkonen, and Kim mo Valtonen
`
`Application of Genetic Search in Derivation of Matrix Models of Peptide Binding to MHC Molecules/ 75
`Vladimir Brusic, Christian Schonbach, Masafumi Takiguchi, Vic Ciesielski, and Leonard C. Harrison
`
`
`
`RIBOWEB: Linking Structural Computations to a Knowledge Base of Published Experimental Data/ 84
`Richard O. Chen, Ramon Felciano, and Russ B. Altman
`
`Density of States, Metastable States, and Saddle Points: Exploring the Energy Landscape of an RNA Molecule 188
`Jan Cupal, Christoph Flamm, Alexander Renner, and Peter F. Stadler
`
`Prediction of Enzyme Classification from Protein Sequence without the Use of Sequence Similarity/ 92
`Marie des Jardins, Peter D. Karp, Markus Krummenacker, Thomas f. Lee, and Christos A. Ouzotmis
`
`Incorporating Global Information into Secondary Structure Prediction
`with Hidden Markov Models of Protein Folds/ l 00
`Valentina Di Francesco, Philip McQueen, Jean Garnier, and Peter]. Munson
`
`Protein Folding Class Predictor for SCOP: Approach Based on Global Descriptors/ I 04
`Inna Dubchak, Ilya Muchnik, and Sung-Hou Kim
`
`A Knowledge Base for D. melanogaster Gene Interactions Involved in Pattern Formation / I 08
`Jerome Euzenat, Christophe Chemla, and Bernard Jacq
`
`Finding Common Sequence and Structure Motifs in a Set of RNA Sequences/ 120
`Jan Gorodkin, Laurie]. Heyer, and Gary D. Stormo
`
`Domain Identification by Clustering Sequence Alignments/ 124
`Xiaojun Guan
`
`RIFLE: Rapid Identification of Microorganisms by Fragment Length Evaluation/ 131
`Henning Hermjakob, Robert Giegerich, and Walter Arnold
`
`Decision Support System for the Evolutionary Classification of Protein Structures/ 140
`Liisa Holm and Chris Sander
`
`Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier/ 147
`Paul Horton and Kenta Nakai
`•
`
`Identifying Chimerism in Proteins Using Hidden Markov Models of Codon Usage/ 153
`Lawrence Hunter and Barry Zeeberg
`
`The Context-Dependence of Amino Acid Properties/ 157
`Thomas R. Joerger
`
`Detection of Distant Structural Similarities in a Set of Proteins Using a Fast Graph-Based Method/ 167
`Ina Koch and Thomas Lengauer
`
`Two Methods for Improving Performance of a HMM and their Application for Gene Finding / 179
`Anders Krogh
`
`ANGLEA: A WWW Server to Assess Protein Structures/ 187
`Francisco Melo, Damien Devos, Eric Depiereux, and Ernest Feytmans
`
`A Fast Heuristic Algorithm for a Probe Mapping Problem / 191
`Brendan Mumey
`
`vi
`
`
`
`Multi-Body Interactions within the Graph of Protein Structure/ 198
`Peter J. Munson and Raj K. Singh
`
`Enumerating and Ranking Discrete Motifs/ 202
`Craig G. Nevill-Manning, Koma/ S. Sethi, Thomas D. Wu, and Douglas L. Brutlag
`
`Selecting Optimal Oligonucleotide Primers for Multiplex PCR / 210
`Pierre Nicodeme and Jean-Marc Steyaert
`
`PDB-REPRDB: A Database of Representative Protein Chains in PDB (Protein Data Bank)/ 214
`Tamotsu Noguchi, Kentaro Onizuka, Yutaka Akiyama, and Minoru Saito
`
`Automatic Construction of Knowledge Base from Biological Papers/ 218
`Yoshihiro Ohta, Yasunori Yamamoto, Tomoko Okazaki, Ikuo Uchiyama, and Toshihisa Takagi
`
`Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives
`for EST and Genome Analysis I 226
`Anders Gorm Pedersen and Henrik Nielsen
`
`Large Scale Protein Modeling and Model Repository/ 234
`Manuel C. Peitsch
`
`Modeling Antibody Side Chain Conformations Using Heuristic Database Search/ 237
`David W. Ritchie and Graham J. L. Kemp
`
`Novel Techniques for Visualizing Biological Information/ 241
`Alan J. Robinson and Tomas P. Flores
`
`A CORBA Server for the Radiation Hybrid Database/ 250
`P. Rodriguez-Tome, C. Helgesen, P. Lijnzaad, and K. Jungfer
`
`Extraction of Substructures of Proteins Essential to their Biological Functions by a Data Mining Technique/ 254
`Kenji Satou, Toshihide Ono, Yoshihisa Yamamura, Emiko Furuichi, Satoru Kuhara, and Toshihisa Takagi
`
`CARTHAGENE: Constructing and Joining Maximum Likelihood Genetic Maps/ 258
`Thomas Schiex and Christine Gaspin
`
`Modeling Transcription Factor Binding Sites with Gibbs Sampling and
`Minimum Description Length Encoding / 268
`Jonathan Schug and G. Christian Overton
`
`Adding Semantics to Genome Databases: Towards an Ontology for Molecular Biology/ 272
`Steffen Schulze-Kremer
`
`Predicting Enzyme Function from Sequence: A Systematic Appraisal/ 276
`Imran Shah and Lawrence Hunter
`
`Hierarchical Protein Structure Superposition Using Both Secondary Structure and Atomic Representations/ 284
`Amit P. Singh and Douglas L. Brutlag
`
`vii
`
`
`
`The Gene-Finder Computer Tools for Analysis of Human and Model Organisms Genome Sequences/ 294
`Vidor Solovyev and Asaf Salamov
`Generating Benchmarks for Multiple Sequence Alignments and Phylogenic Reconstructions/ 303
`Jens Stoye, Dirk Evers, and Folker Meyer
`
`Protein Model Representation and Construction / 307
`M. Sullivan,]. Glasgow, E. Steeg, L. Leherte, and S. Fortier
`
`Automated Alignment of RNA Sequences to Pseudoknotted Structures I 311
`Jack E. Tabaska and Gary D. Stormo
`
`Inference of Molecular Phylogenetic Tree Based on Minimum Model-Based Complexity Method/ 319
`Hiroshi Tanaka, F. Ren, T. Okayama, and T. Gojobori
`
`A New Plug-In Software Architecture Applied for a Portable Molecular Structure Browser/ 329
`Yutaka Ueno and Kiyoshi Asai
`
`SEALS: A System for Easy Analysis of Lots of Sequences / 333
`D. Roland Walker and Eugene V. Koonin
`
`Better Cutters for Protein Mass Fingerprinting: Preliminary Findings / 340
`Michael J. Wise, Tim Littlejohn, and Ian Humphery-Smith
`
`Inferring Gene Structures in Genomic Sequences Using Pattern Recognition and Expressed Sequence Tags / 344
`Ying Xu, Richard J. Mural, and Edward C. Uberbacher
`
`Functional Prediction of B. subtilis Genes from Their Regulatory Sequences/ 354
`Tetsushi Yada, Yasushi Totoki, Takahiro Ishii, and Kenta Nakai
`
`Bayesian Adaptive Alignment and Inference/ 358
`Jun Zhu, Jun Liu, and Charles Lawrence
`
`Index/ 369
`
`viii
`
`
`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`Prediction of Enzyme Classification from Protein Sequence without
`the use of Sequence Similarity
`Marie desJ ardins
`Peter D. Karp
`Markus Krummenacker
`Thomas J . Lee
`Christos A. Ouzounis+
`
`SRI International, 333 Ravenswood Avenue, Menlo
`Park CA 94025, USA, pkarp@ai.sri.com
`+ Current Address: The European Bioinformatics In(cid:173)
`stitute, EMBL Outstation, Wellcome Trust Genome
`Campus, Cambridge UK CBI0 !SD
`
`Abstract 1
`We describe a novel approach for predicting the func(cid:173)
`tion of a protein from its amino-acid sequence. Given
`features that can be computed from the amino-acid se(cid:173)
`quence in a straightforward fashion (such as pl, molec(cid:173)
`ular weight, and amino-acid composition), the tech(cid:173)
`Is the
`nique allows us to answer questions such as:
`protein an enzyme? If so, in which Enzyme Commis(cid:173)
`sion (EC) class does it belong? Our approach uses
`machine learning (ML) techniques to induce classifiers
`that predict the EC class of an enzyme from features
`extracted from its primary sequence. We report on a
`variety of experiments in which we explored the use
`of three different ML techniques in conjunction with
`training datasets derived from PDB and from Swiss(cid:173)
`Prot. We also explored the use of several different fea(cid:173)
`ture sets. Our method is able to predict the first EC
`number of an enzyme with 74% accuracy (thereby as(cid:173)
`signing the enzyme to one of six broad categories of
`enzyme function), and to predict the second EC num(cid:173)
`ber of an enzyme with 68% accuracy ( thereby assign(cid:173)
`ing the enzyme to one of 57 subcategories of enzyme
`function). This technique could be a valuable comple(cid:173)
`ment to sequence-similarity searches and to pathway(cid:173)
`analysis methods.
`
`Introduction
`The most successful technique for identifying the possi(cid:173)
`ble function of anonymous gene products such as those
`generated by genome projects is performing sequence-
`1Copyright © 1997, American Association for Artificial
`Intelligence (www.aaai .org). All rights reserved .
`
`92
`
`ISMB-97
`
`similarity searches against the sequence databas
`(DBs) . Putative functions are assigned on the b:
`sis of the closest similarity of the query sequence t
`proteins of known function. These techniques havo
`achieved a high level of performance: more than 60%
`of H. influenzae ( Casari et al. 1995) and around 403/c
`of M. jannashii (NC et al. 1996) open reading frame;
`(ORFs) have been assigned a specific biochemical func(cid:173)
`tion, at varying degrees of confidence. However, man
`unidentified genes remain in those genomes, and th~
`only way that functional predictions can increase is by
`r~peating t~e searches against larger ( and hopefully
`richer) versions of the sequence DBs. For unique pro(cid:173)
`teins, or large families of hypothetical ORFs, function
`remains unknown , with the current similarity-based
`methodology.
`W: ~ave develop~ a novel approach for accurately
`pred1ctmg the funct10n of a protein from its predicted
`amino-acid sequence, based on the Enzyme Commis(cid:173)
`sion (EC) classification hierarchy (Webb 1992) . Given
`features that can be computed from the amino-acid
`sequence in a straightforward fashion, the technique
`allows us to answer questions such as: Is the protein
`an enzyme? If so, in which EC class does it belong?
`Our approach uses machine learning (ML) tech(cid:173)
`niques to induce classifiers that predict the EC class
`of an enzyme from features extracted from its primary
`sequence. We report on a variety of experiments in
`which we explored the use of three different ML tech(cid:173)
`niques in conjunction with training datasets derived
`from PDB and from Swiss-Prot. We also explored the
`use of several different feature sets.
`Problem Definition
`The aim of this work is to produce classifier programs
`that predict protein function based on features that
`can be derived from amino-acid sequence, or a 3-D
`structure. The classifiers will predict whether the pro(cid:173)
`tein is an enzyme, as opposed to performing some other
`If the protein is an enzyme, we would
`cellular role.
`
`
`
`prefer to know its exact activity: however, we have as(cid:173)
`sumed that learning to predict exact activities is too
`difficult a problem, partly because sufficient training
`data is not available. We therefore focus on the prob(cid:173)
`lem of predicting the general class of activity of an
`enzyme, which can also be valuable information.
`Our work makes use of the EC hierarchy. This clas(cid:173)
`sification system organizes many known enzymatic ac(cid:173)
`tivities into a four-tiered hierarchy that consists of 6
`top-level classes, 57 second-level classes, and 197 third(cid:173)
`level classes; the fourth level comprises approximately
`3,000 instances of enzyme function. The organizing
`principle of the classification system is to group to(cid:173)
`gether enzymatic activities that accomplish chemically
`similar transformations. The central assumption un(cid:173)
`derlying our work is that proteins that catalyze re(cid:173)
`actions that are similar within the EC classification
`scheme will also have similar physical properties.
`We constructed classifiers that solve three different
`problems:
`
`• Level-0 problem: Is the protein an enzyme?
`
`• Level-I problem: If the protein is an enzyme, in
`which of the 6 first-level EC classes does its reaction
`belong?
`
`• Level-2 problem: If the protein is an enzyme, in
`which of the 57 second-level EC classes does its re(cid:173)
`action belong?
`
`For each prediction problem we ran several machine(cid:173)
`learning algorithms to examine which performed best.
`We also employed several different training datasets for
`each prediction problem to determine what features are
`most informative, and to explore the sensitivity of the
`method to different distributions of the training data.
`The only similar work we are aware of is Wu's work
`on learning descriptions of PIR protein superfa.milies
`using neural networks (CH 1996).
`
`Methods
`Our methodology for applying ML to the enzyme clas(cid:173)
`sification problem was as follows:
`
`1. Characterize the classification problem, and identify
`the characteristics of this problem that would influ(cid:173)
`ence the choice of an appropriate ML method.
`
`2. Select one or more ML methods to apply to the clas(cid:173)
`sification problem.
`
`3. Create a small dataset from available data sources.
`
`4. Run the selected ML methods on the small dataset.
`
`5. Evaluate the results, and make changes to the exper(cid:173)
`imental setup by (a) Reformulating the classification
`problem (e.g., adding new prediction classes), (b)
`Eliminating noisy or problem data points from the
`dataset, (c) Eliminating redundant or useless fea(cid:173)
`tures, or adding new features to the data, (d) Adding
`or deleting ML methods from the "toolkit" of meth(cid:173)
`ods to be applied
`
`6. When the above process is complete, create a larger
`alternative dataset, run the selected ML methods,
`and evaluate the results.
`
`7. Evaluate the results on all datasets, with all ML
`methods, with respect to the baseline test of
`a sequence-similarity search, currently the most
`widely used method of approaching this problem (P,
`C, & C 1994).
`
`We started with a small dataset to familiarize our(cid:173)
`selves with the domain, identify the features to be
`learned, and provide a testing ground for exploring
`the space of experimental setups, before scaling up
`to larger datasets. These larger datasets served to
`check the generality and scalability of the experimen(cid:173)
`tal approach in real-world situations. The sequence(cid:173)
`similarity baseline provides a means of assessing the
`overall performance of the approach: Do ML methods
`make better predictions than sequence similarity? Are
`there some classes or particular cases for which ML
`methods perform better or worse?
`Problem characteristics The features in this do(cid:173)
`main are mostly numerical attributes, so algorithms
`that are primarily designed to operate on symbolic at(cid:173)
`tributes are inappropriate. The prediction problem is a
`multiclass learning problem (e.g., there are 6 top-level
`EC classes and 57 second-level EC classes to predict),
`for which not all learning algorithms are suited. The
`features are not independent (e.g., the sum of the nor(cid:173)
`malized proportions of the amino acids will always be
`one), so algorithms that rely heavily on independent
`features may not work well. Most important, there
`may be noisy data (incorrect or missing feature values
`or class values), and we do not expect to be able to
`learn a classifier that predicts the EC class with per(cid:173)
`fect accuracy, so the algorithm must be able to handle
`noise. Such examples are sequence entries that are
`fragments but do have an assigned EC number, or real
`enzymes with no EC numbers assigned to them.
`ML methods Based on the above problem char(cid:173)
`acteristics, we selected three learning algorithms: dis(cid:173)
`cretized naive Bayes (DNB), C4.5, and Instance-Based
`Learning (IBL).
`Discretized naive Bayes (J, R, & M 1995) is a simple
`algorithm that stores all the training instances in bins
`
`Des Jardins
`
`93
`
`
`
`according to their (discretized) feature values. The
`algorithm assumes that the features are independent
`given the value of the class. To make a prediction, it
`does a table lookup for each feature value to determine
`the associated probability of each class, given the fea(cid:173)
`ture's value, and combines them using Bayes' rule to
`make an overall prediction.
`C4.5 (JR 1993) induces classifiers in the form of de(cid:173)
`cision trees, by recursively splitting the set of training
`examples by feature values. An information-theoretic
`measure is applied at each node in the tree to deter(cid:173)
`mine which feature best divides the subset of examples
`covered by that node. Following the tree construction
`process, a pruning step is applied to remove branches
`that have low estimated predictive performance.
`The term instance-based learning (IBL) covers a
`class of algorithms that store the training instances,
`and make predictions on a new instance J by retrieving
`the nearest instance N (according to some similarity
`metric over the feature space) and then returning the
`class of N as the class of I ( or by making a weighted
`prediction from a set of nearest instances) (Aha DW
`1991).
`Feature engineering Feature engineering, or the
`problem of identifying an appropriate set of features
`and feature values to characterize a learning problem,
`is a critical problem in real-world applications of ML
`algorithms. This process frequently represents a sub(cid:173)
`stantial part of the time spent on developing the appli(cid:173)
`cation, and this project was no exception. Section de(cid:173)
`scribes the features we identified, and their biochemical
`significance. Section discusses the process by which
`we identified and removed redundant features. Sec(cid:173)
`tion gives the results for the alternative datasets and
`feature sets that were explored.
`Large datasets We used extended datasets for the
`final evaluation of the ML methods on the enzyme clas(cid:173)
`sification problem. We created several versions of these
`a full Swiss-Prot version, and several "bal(cid:173)
`datasets -
`anced" datasets that contain a random sampling of
`the proteins in the Swiss-Prot DB, selected to have a
`class distribution ( of enzymes vs. non-enzymes) sim(cid:173)
`ilar to the PDB dataset in the first case, and to the
`distribution of enzymes versus non-enzymes in com(cid:173)
`plete genomes in the second case.
`Sequence similarity The predictions using ML
`have been compared with function assignments made
`through sequence similarity. We used BLAST
`(Altschul et al. 1990) with standard search param(cid:173)
`eters and a special filtering procedure {unpublished),
`against the equivalent datasets from the ML experi(cid:173)
`ments. Query sequences (with or without an EC num(cid:173)
`ber) were predicted to have the EC number of the dos-
`
`94
`
`ISMB-97
`
`est homologue (if applicable). Only significant homolo(cid:173)
`gies were considered, with a default cut-off P-value of
`10-6 and careful manual evaluation of the DB search
`In this manner, we have obtained an ace _
`results.
`racy estimate for the similarity-based methods. It~
`1s
`h
`.
`.
`mterestmg to note t at such an experiment is to
`our
`,
`.
`knowledge, umque.
`
`Features
`For the core set of features used as inputs to the ML
`programs, we used properties that can be directly com(cid:173)
`puted from primary sequence information, so they can
`be used ~or predicting the function of ORFs whose
`structu~e 1s a~so unknown. Those features are length of
`the ammo acid sequence, the molecular weight mw of
`the sequence, and the amino acid composition, repre(cid:173)
`sented as 20 values {pa pc pd pe pf pg ph pi pk pl pm pn
`pp pq pr ps pt pv pw PY} in the range from O to 1, each
`value standing for the respective residue frequency as
`a fraction of the total sequence length.
`The feature charge was computed by summing the
`contributions of charged amino acids. The features
`ip (isoelectric point) and extinction coefficient were
`calculated by the program "Peptidesort" (Peptides(cid:173)
`ort is from the GCG package, version 8.0-Open VMS ,
`September 1994).
`The secondary structural features helix, strand, and
`turn, which we used for one experiment, were ex(cid:173)
`tracted from information in the FT fields of Swiss-Prot
`records. For all such lines with a HELIX, STRAND, or
`TURN keyword, the numbers of amino acids between
`the indicated positions were summed up, to calculate
`the total percentages of amino acids that are part of
`these structures, respectively. We included this infor(cid:173)
`mation, since it was available for the proteins in the
`PDB, to see how well it would improve the prediction
`quality of the learned classifiers if it were available for
`an unknown protein. Secondary structure can be esti(cid:173)
`mated from the primary sequence ( although not with
`perfect accuracy), and using this estimated secondary
`structure might be worthwhile in making predictions
`if secondary structure proved to be a strong enough
`predictor of enzyme class.
`
`Datasets
`We obtained EC classes from version 21.0 of the EN(cid:173)
`ZYME DB (Bairoch 1996). We prepared datasets de(cid:173)
`rived from the PDB and Swiss-Prot.
`Dataset 1: This family of datasets originated from
`the PDB subset of Swiss-Prot,2 containing 999 entries.
`Features for these protein sequences were calculated as
`2See ftp: //expasy .hcuge. ch/databases/Sviss-Prot
`/special..selections/pdb.seq.180496
`
`
`
`in Section 3.1. EC numbers were extracted from the
`text string in the (DE) field of the Swiss-Prot records
`(more than one EC can occur in one entry). We cre(cid:173)
`ated several variants of this dataset, containing differ(cid:173)
`ent features.
`Dataset la: Features:
`{ length mw charge ip extinction pa pc pd pe pf pg ph
`pi pk pl pm pn pp pq pr ps pt pv pw py}
`Dataset lb: We dropped the ip and extinction/mw
`features, because they were strongly correlated with
`charge and length, respectively. Features:
`{ length charge pa pc pd pe pf pg ph pi pk pl pm pn
`pp pq pr ps pt pv pw py}
`Dataset le: We reduced the feature set fur(cid:173)
`ther by combining the composition percentages of
`amino acids with similar biochemical properties (WR
`1986). The following subsets were grouped together:
`ag c de fwy h ilv kr m nq p st, reducing the num(cid:173)
`ber of amino acid composition features from twenty to
`eleven. Features:
`{ length mw charge pag pc pde pfwy ph pilv pkr pm
`pnq pp pst}
`Dataset ld: Three new secondary structural fea(cid:173)
`tures were added to the features in 1 b. Features:
`{ length mw charge pa pc pd pe pf pg ph pi pk pl pm
`pn pp pq pr ps pt pv pw py helix strand turn}
`Dataset 2: The raw data originated from the full
`release of Swiss-Prot version 33 (Bairoch & Boeck(cid:173)
`mann 1994), containing 52205 entries. Features were
`computed using the "aacomp" program (aacomp is
`part of the FASTA package). Secondary structural fea(cid:173)
`tures were omitted, because only a small minority of
`entries carry this information. Feature set:
`{ length charge pa pc pd pe pf pg ph pi pk pl pm pn
`pp pq pr ps pt pv pw py}
`
`Characterization of the Data
`Table 1 provides a numerical overview of the entries
`in the datasets. There is a notable difference between
`Datasets 1 and 2 in the percentage of entries that have
`an EC-number, perhaps because enzymes are a com(cid:173)
`mon object of study by crystallographers. Dataset 2 is
`probably closer to the natural enzyme vs. non-enzyme
`distribution in the protein universe.
`A few entries have more than one EC number, for ex(cid:173)
`ample, multifunctional or multidomain enzymes. We
`have excluded all these cases from the final dataset,
`on the assumption that they will introduce noise in
`the EC-classification experiments. Entries without any
`EC number are presumed not to be enzymes. How(cid:173)
`ever, we could envision data entry errors of omission
`that would violate this assumption. We performed a
`search for the string "ase" in the DE field of Swiss-
`
`Prot records that lack EC numbers to find potential
`mis-annotated enzymes. This search did pull out quite
`a few hits, which could act as false positives in the non(cid:173)
`EC class. Dataset 2 contained too many such cases for
`a hand-analysis, but the Dataset 1 cases were exam(cid:173)
`ined. About half of the cases were enzyme-inhibitors,
`some were enzyme-precursors, and a few entries did
`seem to be enzymes.
`
`Results
`We present three groups of results. The sequence of
`experiments reflects our exploration of various subsets
`of features for the prediction problems under study.
`The first group involves the PDB datasets; the second
`group involves the Swiss-Prat dataset. We explored
`how well the learning algorithms scaled with training
`set size and composition. The third group compares
`the results of the learning experiments with sequence
`similarity -
`a mature technique for function predic(cid:173)
`tion.
`Learning experiments were conducted by preprocess(cid:173)
`ing the dataset of interest to produce three different
`input files, one each for the level 0, level 1, and level
`2 prediction problems. Preprocessing consisted of ex(cid:173)
`cluding non-enzymes from the level 1 and level 2 files,
`and formatting the class appropriately for the prob(cid:173)
`lem, that is, a binary value for level 0, one of six values
`for level 1, and one of 57 values for level 2 (actually
`51, since the datasets only represented 51 of the 57
`possible two-place EC numbers). We omitted level 2
`experiments with the PDB datasets because 500 train(cid:173)
`ing instances are too few to learn 51 classes.
`Each experiment consisted of performing a tenfold
`cross-validation using a random 90:10% partition of the
`data into a training set and test set, respectively. A
`suite of three experiments was run for each input file,
`one for each of the three learning algorithms DNB,
`C4.5, and IB. Results are reported as percentage of
`test set instances correctly classified by each algorithm,
`averaged over the ten cross-validation runs.
`Experiments were conducted using MLC++ version
`1.2 (R & D 1995), a package of ML utilities. All exper(cid:173)
`iments were run on a Sun Microsystems Ultra-I work(cid:173)
`station with 128 MB of memory, under Solaris 2.5.1.
`
`Results for the PDB Datasets
`Experiments involving Dataset 1 are shown in Ta(cid:173)
`ble 2. Dataset la provides a baseline. The results for
`Dataset 1 b show that the features extinction, isoelec(cid:173)
`tric point, and molecular weight are redundant since
`each is strongly correlated with either charge or length
`( a principal component analysis of the feature sets also
`confirmed this fact - not shown). Those features were
`
`Des Jardins
`
`95
`
`
`
`entries with exactly one EC number:
`entries without EC number:
`entries with multiple EC numbers (not used):
`total number of entries in the raw dataset:
`
`Dataset 1
`
`Dataset 2
`
`416 (41.6%)
`565 (56.6%)
`18 (1.8%)
`999
`
`14709 (28.2%)
`36997 (70.9%)
`499 (1.0%)
`52205
`
`entries with no EC number but "ase" in name:
`
`35 (3.5%)
`
`2156 (4.1%)
`
`Table 1: Summary of data characteristics
`
`Dataset and Features Used
`(la) Initial features
`{la) Initial features
`(lb) Nonredundant features
`{lb) Nonredundant features
`(le) Amino acid grouping
`(le) Amino acid grouping
`(ld) Structural, unknowns
`(ld) Structural, unknowns
`{ld) Structural, no unknowns
`{ld) Structural, no unknowns
`
`Problem
`level 0
`level 1
`level 0
`level 1
`level 0
`level 1
`level 0
`level 1
`level 0
`level 1
`
`Instances
`980
`416
`980
`416
`980
`416
`980
`416
`630
`266
`
`DNB C4.5
`IB
`79.29 76.63 78.37
`60.10 48.54 50.53
`76.33 78.98
`77.96
`62.74 48.54 48.64
`74.59 74.49 74.08
`57.46 45.90 46.63
`77.48 77.98 76.97
`55.06 47.64 49.59
`81.59 80.16 76.19
`63.50 47.35 48.52
`
`Table 2: Classification accuracies on the PDB dataset for various representations
`
`excluded from future experiments.
`Experiment le asks whether accuracy can be im(cid:173)
`proved by creating new features that group amino acids
`according to their biochemical properties. It is surpris(cid:173)
`ing that the results with this representation were uni(cid:173)
`versally worse. It is likely that useful information is
`lost with this reduction. We concluded that, since pre(cid:173)
`diction was better with more features, we had not yet
`reached an upper bound on feature set size and could
`effectively add features without overwhelming any of
`the learning algorithms. We have not yet explored
`other groupings of amino acids.
`Our next experiments added secondary structure in(cid:173)
`formation to the feature set by including the helix,
`strand, and turn features. Because the values for these
`features were not available for a high proportion ( over
`50%) of instances in PDB, we conducted two suites of
`experiments. The first suite used all instances in our
`PDB dataset, and annotated missing structure features
`as unknown values. The second suite excluded all in(cid:173)
`stances for which structural data was missing. With
`unknowns excluded, accuracy did improve somewhat
`with the addition of structure features. However, the
`improvement is rather small. The value of structural
`composition is unclear, and further exploration has
`been left for future work.
`
`96
`
`ISMB-97
`
`Results from the Swiss-Prot Dataset
`
`We conducted experiments using two subsets of Swiss(cid:173)
`Prot, as well as with the full dataset and with some
`class-balanced datasets. Results are listed in Ta(cid:173)
`ble 3. The first was a yeast subset, consisting of
`all instances for the species Saccharomyces cerevisiae
`(baker's yeast) and Schizosaccharomyces pombe (fis(cid:173)
`sion yeast). The second was a prokaryotic subset,
`consisting of all instances for the species E. coli,
`Salmonella typhimurium, Azotobacter vinelandii, Azo(cid:173)
`tobacter chroococcum, Pseudomonas aeruginosa, Pseu(cid:173)
`domonas putida, Haemophilus influenzae, and various
`Bacillus species.
`As was observed with the PDB datasets, the IB al(cid:173)
`gorithm performs the best overall. Although the other
`two algorithms are comparable for the simpler level
`0 problems, they degrade substantially more than IB
`does as the number of classses increases. It also ap(cid:173)
`pears as if IB improves generally, though not univer(cid:173)
`sally, as the number of training instances increases.
`This is most apparent in the 67.6% accuracy it at(cid:173)
`tains for the full 51-class problem. This is an en(cid:173)
`couraging trend, lending hope that classification ac(cid:173)
`curacy will improve as more sequence data becomes
`available. IB consumes substantial machine resources
`during the training p



