`
`doi:10.1093/bioinformatics/btp284
`
`Genome analysis
`Approximate Bayesian feature selection on a large meta-dataset
`offers novel insights on factors that effect siRNA potency
`Jochen W. Klingelhoefer1,†, Loukas Moutsianas2,† and Chris Holmes2,3,∗
`1Department of Biochemistry, University of Oxford, Oxford, OX1 3QU, 2Department of Statistics, University of Oxford,
`Oxford OX1 3TG and 3MRC Harwell, Oxon, OX11 0RD, UK
`Received on January 28, 2009; revised on April 19, 2009; accepted on April 22, 2009
`Advance Access publication May 6, 2009
`Associate Editor: Alison
`
`ABSTRACT
`Motivation:
`Short interfering RNA (siRNA)-induced RNA inter-
`ference is an endogenous pathway in sequence-specific gene
`silencing. The potency of different siRNAs to inhibit a common
`target varies greatly and features affecting inhibition are of high
`current interest. The limited success in predicting siRNA potency
`being reported so far could originate in the small number and the
`heterogeneity of available datasets in addition to the knowledge-
`driven, empirical basis on which features thought to be affecting
`siRNA potency are often chosen. We attempt to overcome these
`problems by first constructing a meta-dataset of 6483 publicly
`available siRNAs (targeting mammalian mRNA), the largest to date,
`and then applying a Bayesian analysis which accommodates feature
`set uncertainty. A stochastic logistic regression-based algorithm is
`designed to explore a vast model space of 497 compositional,
`structural and thermodynamic features, identifying associations with
`siRNA potency.
`Results: Our algorithm reveals a number of features associated
`with siRNA potency that are, to the best of our knowledge, either
`under reported in literature, such as anti-sense 5(cid:3)–3(cid:3) motif ‘UCU’, or
`not reported at all, such as the anti-sense 5(cid:3)-3(cid:3) motif ‘ACGA’. These
`findings should aid in improving future siRNA potency predictions
`and might offer further insights into the working of the RNA-induced
`silencing complex (RISC).
`Contact: cholmes@stats.ox.ac.uk
`Supplementary information: Supplementary data are available at
`Bioinformatics online.
`
`1 INTRODUCTION
`RNA interference (RNAi) is a post-transcriptional gene silencing
`(PTGS) mechanism which inhibits gene expression. It is mediated
`by double-stranded RNA (dsRNA) or by transcripts that form stem-
`loops (short hairpin RNA, shRNA).
`During the process of RNAi, dsRNA/shRNA is split by the
`Ribonuclease III enzyme Dicer into 19-nt dsRNA molecules with
`(cid:3)
`2-nt 3
`-overhangs, named short interfering RNA (siRNA, depicted
`in Figure 1). The siRNA then interacts with the RNA-induced
`silencing complex (RISC), specifically with its catalytic component,
`
`∗
`To whom correspondence should be addressed.
`†The authors wish it to be known that, in their opinion, the first two authors
`should be considered as joint First authors.
`
`Fig. 1. Structure of short
`interfering RNA (siRNA). Both guide and
`passenger strands are displayed and the guide strand nucleotides are
`(cid:3)
`(cid:3)
`numbered in 5
`to 3
`direction.
`
`the Argonaute protein. RISC-Argonaute separates the two strands
`of the siRNA molecule into the guide strand (anti-sense to the
`targeted mRNA), which is loaded into the Argonaute protein,
`and the passenger strand (sense to the targeted mRNA), which is
`released into the cytoplasm and subsequently degraded (Lodish
`et al., 2004). In a following step,
`the RISC–siRNA complex
`probes cytoplasmic mRNA molecules for sequence complementarity
`to the loaded siRNA guide strand. Matching mRNA molecules
`are cut around the center of the siRNA–mRNA interaction site
`(cleavage site) by the Argonaute protein, effectively inhibiting the
`expression of the respective gene product. Wang et al. have recently
`published a crystallographic structure of the complex, taken from
`an eubacterium (Wang et al., 2008).
`Not long after its discovery, it was known that not all siRNAs are
`equally potent (Holen et al. 2002). The search for potent siRNAs
`and for features that could explain their silencing capabilities has
`since become the focus of numerous research groups in the field.
`Initially, most groups conducted experiments on smaller sets of
`mRNAs. They measured the capability of different siRNA sequences
`to silence their respective gene products and tried to identify features
`that could be linked to the reported siRNA potency. Until recently,
`the focus was on compositional or thermodynamic features, both
`of which are sequence-based. Compositional features describe the
`occurrences of certain nucleotides at certain positions of the siRNA
`sequences, whereas thermodynamic features are concerned with
`binding free energies and stabilities of the sequences. A third group
`of features used in literature are structure-based ones (Shao et al.,
`2006, 2007; Vickers et al., 2003). These include secondary structure
`characteristics of both the siRNA and its mRNA target. According to
`Patzel et al., siRNAs with no defined secondary structure correlate
`with increased potency (Patzel et al., 2005). Moreover, one would
`expect target accessibility to be decisive for whether an otherwise
`
`© 2009 The Author(s)
`This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
`by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`Alnylam Exh. 1056
`
`
`
`Approximate Bayesian feature selection
`
`on siRNAs that target mammalian mRNA). As far as the authors
`are aware this is the largest study of its kind to date. By following a
`purely data-driven approach, we hoped to confirm most of the recent
`findings, to resolve cases with contrasting evidence from different
`studies and to discover novel features that associate with siRNA
`potency.
`in order to
`A stochastic approach was considered essential,
`efficiently explore the vast feature space of 497 compositional,
`thermodynamic and structural features that were put together to be
`tested for association with siRNAs potency. These covered most,
`if not all, of the features reported in recent studies, resulting in the
`most extensive set of features worked on so far. As a result, we
`present an efficient algorithm which succeeds in identifying novel
`features that significantly affect siRNA potency while performing
`comparably to most successful recent prediction methods (see
`Supplementary Material for more details). We then employ our
`algorithm to suggest 10 potent siRNAs for each of the human
`mRNA listings in the NCBI RefSeq database (Pruit et al., 2007;
`http://www.ncbi.nlm.nih.gov/RefSeq). These results can be found
`at our website.
`
`2 METHODS
`2.1 Consensus format for siRNA sequences
`The following set of rules was chosen in this study, defining how siRNA
`sequences and the measurement of their respective silencing efficacy are
`presented:
`
`(cid:3)
`
`(cid:3)
`
`to 3
`
`well-designed siRNA can bind to its complementary part on the
`mRNA sequence (Ding et al., 2004). Researchers who incorporated
`mRNA characteristics in their feature set reported that, although
`they appear to be correlated with siRNA potency when no additional
`features were selected, they seemed to offer little to the predictive
`strength of their models, when added to an existing set of sequence-
`based features (Katoh and Suzuki, 2007; Peek, 2007). Others,
`however, did report correlation of site accessibility with siRNA
`potency (Shao et al., 2006).
`Some researchers prefer to work on purpose-build datasets
`produced by them (Katoh and Suzuki, 2007; Reynolds et al., 2004;
`Shao et al., 2006; Ui-Tei et al., 2004), which are usually small
`and target only a few mRNAs. By contrast, others perform their
`analysis on combined data from heterogeneous sources (Holen,
`2006; Matveeva et al., 2007; Saetrom, 2004; Shabalina et al., 2006;
`Vert et al., 2007). The latter is becoming increasingly popular, as
`the amount of publicly available data steadily increases. The large
`dataset by Huesken et al. (Huesken et al., 2005) can be seen as a
`landmark towards this direction.
`A problem with studying heterogeneous datasets though is the
`variability in the biological methods employed, as well as in the
`information provided to the scientific community. As an example,
`earlier studies tended to employ 19-nt sequences (Ui-Tei et al.,
`2004), while more recent studies use 21-nt siRNA sequences,
`(cid:3)
`taking the 2-nt 3
`siRNA overhangs into account. These 2-nt
`overhangs appear to be correlated with siRNA potency (Huesken
`et al., 2005); however, this correlation cannot be tested on a 19-nt
`dataset. Matveeva et al. encounter this problem in their recent study
`(Matveeva et al., 2007), as they build two prediction models based
`on a 21-nt dataset, but can only test them on three 19-nt datasets.
`Another problem encountered when combining data from
`heterogeneous sources is the lack of detailed information regarding
`the target sequence, such as which particular transcript was targeted
`and what the exact sequence of that transcript is. Such issues render
`the incorporation of structural mRNA information into a model
`challenging.
`Having gathered the experimental data, a variety of statistical
`methods can be employed in an attempt to determine which features
`of a candidate siRNA sequence influence its potency. These include
`simple linear regression-based methods (Matveeva et al., 2007;
`Shabalina et al., 2006; Ui-Tei et al., 2004; Vert et al., 2007) as
`well as more complex methods like neural networks (Huesken
`et al., 2005; Shabalina et al., 2006), Euler graphs (Pancoska
`et al., 2004), Support Vector Machines (Ladunga, 2007; Peek,
`2007; Saetrom, 2004; Teramoto et al., 2005), genetic programming
`(Saetrom, 2004) and disjunctive rule merging (Gong et al., 2008).
`With the performance of linear regression-based methods shown to
`be comparable to most of the more complex methods (Matveeva
`et al., 2007; Shabalina et al., 2006; Vert et al., 2007), linear
`regression has proven to be a reasonable choice, given its simplicity
`of concept and implementation and the interpretability of the results
`it produces. For these reasons, it was decided to be employed in this
`study.
`We have investigated an approximate Bayesian Markov chain
`Monte Carlo feature selection algorithm using a Bayesian
`Information Criteria (BIC)
`(cf. Kass and Raftery, 1995)
`to
`approximate the Bayes factor for a logistic regression model on a
`large meta-dataset of 6483 siRNA’s that we constructed employing
`the most widely used, publicly available datasets (we focus only
`
`• siRNA sequences are stated as anti-sense sequences from 5
`(Figure 1)
`• Potency (synonymous to efficiency and efficacy) refers to the ability of
`an siRNA to inhibit (synonymous to knock-down and down-regulate)
`a gene product
`• Product level refers to the percentage of the gene product remaining
`after siRNA-mediated RNA interference
`• Product levels of zero and one are assigned to fully potent and non-
`potent siRNAs, respectively
`• Only 19-nt siRNA sequences are employed in this study (3
`overhangs are neglected)
`• Compositional features are represented with the initial of the base
`followed by its position on the anti-sense strand, e.g. U7 means Uracil
`at nucleotide position 7 of the siRNA anti-sense strand.
`
`(cid:3)
`
`siRNA
`
`2.2 Features
`Four hundred and ninety-seven features have been included in our systematic
`study spanning position-dependent nucleotide preferences, GC content of
`the siRNA sequence, presence of 2-, 3- and 4-mer sequence motifs,
`presence of known innate inferon response-stimulating motifs, occurrence of
`palindromes, thermodynamic features and structural features of the siRNA.
`Most of the features that were tested in recent studies and found to be
`correlated with siRNA potency have been included in our feature set. In
`addition, the set was enriched by features that were not been tested before
`(e.g. tetranucleotide free energy differences). For a detailed description of
`each group of features and an exhaustive list of the features included in this
`study, please refer to the Supplementary Material.
`
`2.3 Databases
`We collected data from the online database siRecords (Ren et al.,
`2009; http://sirecords.umn.edu/siRecords) and from recently published
`
`1595
`
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`
`
`J.W.Klingelhoefer et al.
`
`Table 1. Overview of datasets employed in this study
`
`Dataset
`
`Size
`
`Reference
`
`Strand
`Format
`
`SiRNA
`concentration
`
`Potency
`
`Also contained in
`
`siRecords (SIR)
`Sloan–Kettering
`(SLO)
`Isis (ISI)
`Novartis (NOV)
`Katoh (KAT)
`Shabalina (SHA)
`Saetrom (SAE)
`Phipps (PHI)
`Amgen–
`Dharmacon
`(AMG)
`
`2881
`601
`
`67
`2431
`702
`653
`537
`26
`239
`
`Ren et al. (2009)
`Jagla et al. (2005)
`
`Vickers et al. (2003)
`Huesken et al. (2005)
`Katoh and Suzuki (2007)
`Shabalina et al. (2006)c
`Saetrom (2004)c
`Phipps et al. (2004)
`Reynolds et al.
`(2004)
`Khvorova et al. (2003)
`
`S
`AS
`
`AS
`AS
`S
`AS
`AS
`S
`AS
`
`Variablea
`100 nM
`
`100 nM
`50 nM
`10/25 nM
`Variablea
`Variablea
`300 nM
`100 nM
`
`Four classes
`[0, 1]
`
`–
`–
`
`[0, 1]
`[0, 1]
`[0, 1]b
`[0, 1]
`[0, 1]
`[0, 1]b
`[0, 1]
`
`SHA, SAE, SIR
`–
`–
`SAE, SIR
`–
`SIR
`SHA, SAE, SIR
`
`The columns for dataset, size and reference refer to the name and abbreviation, the number of siRNA samples contained and the reference to the dataset, respectively. Strand format
`indicates which strand is reported in the original study: sense (S) or antisense (AS). In the next column the siRNA concentration, as reported in the respective study, is stated.
`Furthermore, potency is either reported in a continuous scale over [0, 1], where 0 represents fully potent and 1 non-potent or as discrete value, where samples are split into different
`potency classes.
`aFor these datasets, siRNA data has been collected from various experiments, all at slightly different experimental condition, so a common value for siRNA concentration used
`cannot be stated.
`bIn these datasets, potency was reported in a continuous scale over [0, 1], but fully potent entries were represented by 1, and non-potent by 0.
`cFor these combined datasets, a reference list of the individual datasets they contain is given in the main article.
`
`(Katoh and Suzuki, 2007;
`datasets, which were either purpose-built
`Phipps et al., 2004) or collections of other, earlier published ones (Matveeva
`et al., 2007; Saetrom, 2004; Shabalina et al., 2006). In a pre-processing step,
`we limited our data to siRNA sequences targeting mammalian mRNA.
`We focused at siRecords as it represents one of the largest resources
`of curated siRNA data, which is accessible to academic/non-commercial
`users by download (current release: 18 Aug 2008). Other online databases,
`such as HuSiDa (http://itb.biologie.hu-berlin.de/∼nebulus/sirna/index.htm)
`or siR (Reynolds2004, http://www.mpibpc.gwdg.de/abteilungen/100/105/
`sirna.html), are available and will be considered for inclusion into future
`versions of our dataset.
`In our report, we refer to the combined datasets of Shabalina et al.
`(Shabalina et al., 2006) and Saetrom et al. (Saetrom, 2004) as SHA
`and SAE, respectively to highlight the fact that we obtained the data
`from the aforementioned sources. These datasets though are themselves a
`heterogeneous set of data produced by many different groups (Aza-Blanc et
`al., 2003; Giddings et al., 2000; Harboth et al., 2003; Holen et al., 2002;
`Hsieh et al., 2004; Jackson et al., 2003; Kawasaki et al., 2003; Khvorova
`et al., 2003; Kumar et al., 2003; Reynolds et al., 2004; Ui-Tei et al., 2004;
`Vickers et al., 2003). A summary, highlighting the key features of each dataset
`and a reference for each, is presented in Table 1.
`The task of combining data from heterogeneous sources into a single
`meta-analysis study is not straightforward. Two of the key problems are
`(i) the inconsistency in measuring and reporting siRNA potency (Jagla
`et al., 2005) and (ii) the fact that reported results might refer to either
`sense or antisense strand siRNA sequences, and the necessary transformation
`to make findings comparable might introduce errors (Leuschner et al.,
`2006). Both of these problems are addressed in detail in the Supplementary
`Material.
`It should be highlighted that in studies which are not specifically aimed
`at assessing silencing efficacy of siRNA sequences and where candidate
`samples have been picked with the help of a design algorithm the reported
`siRNAs will not represent the full spectrum of possible sequences, eventually
`introducing a bias. Moreover, in such cases often only the potent siRNAs are
`reported, leading to over-representation of potent siRNAs over non-potent,
`as is the case for the siRecords (SIR) database. Combining data from various
`
`sources should aid in reducing these biases, allowing us to better explore the
`siRNA sequence state space.
`Another issue to be taken into consideration for all siRNA potency
`studies is the concentration of siRNA used in the silencing experiment.
`It can be seen in Table 1 that the concentrations of siRNA employed
`in the various experiments vary. It is generally accepted that silencing
`efficiency will be compromised when using too low siRNA concentrations.
`On the other hand, too high concentrations can induce non-specific, off-target
`effects (Persengiev et al., 2004), which can be reduced, albeit not alleviated
`(Jackson2003b), by reducing the siRNA concentration. Even though some
`groups report very similar siRNA potencies for concentrations varying
`between 20–100 nM (Jackson et al., 2003; Semizarov et al., 2003), there
`appears to be no consensus on the optimal amount to use—a value, which
`might well be gene- and tissue-specific.
`Although a breakdown of the data according to siRNA concentration
`employed in the study might have increased the power of our study, we
`decided not to pool our data due to the already small number of available
`siRNA samples (compared to the vast siRNA state space). Given that a
`slight concentration-specific difference in silencing efficiency is likely to
`exist in our dataset, we expect a small artificial increase in variance, which
`we deem unlikely to have any effect on the selection of the most important
`predictors.
`Concentrations were not reported for the siRecords database. However, as
`will be shown in detail below, the leave-one-out approach followed by our
`algorithm yielded similar results qualitatively when SIR data was excluded
`from the training set. This could mean that the siRNA concentrations
`employed to derive most of the entries in SIR are similar to those in
`the other datasets, or that the feature selection process is not significantly
`affected by difference in concentrations amongst datasets, or both. A proof
`of this statement will be left to a follow-up study which will hopefully
`be conducted in the light of a larger available dataset and more complete
`data.
`Starting with our pool of nine datasets (SLO, ISI, AMG, NOV, KAT, SHA,
`SAE, SIR, PHI), all datasets were first checked for uniqueness and then
`cross-checked against each other to retrieve a set of unique and independent
`datasets. ISI, AMG and SHA were found to be fully contained in SIR and
`
`1596
`
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`
`
`Table 2. Ratio of potent to non-potent entries contained in the datasets
`
`Dataset
`
`SLO
`NOV
`KAT
`SAE
`SIR
`
`Total
`
`Size
`
`601
`2431
`702
`509
`2240
`
`6438
`
`Potent
`
`Non-potent
`
`Ratio
`
`179
`1222
`176
`197
`1577
`
`3351
`
`422
`1209
`526
`312
`663
`
`3132
`
`0.4242
`1.0108
`0.3346
`0.6314
`2.3786
`
`1.067
`
`SAE, they were dropped from further investigation, leaving the following
`datasets to be included in this study:
`
`• SLO: SLO complete (601 samples)
`• NOV: NOV complete (2431 sequences)
`• KAT: KAT complete (702 sequences)
`• SIR: SIR without entries from ISI, AMG, KAT, SHA, PHI and SAE
`(2240 sequences)
`• SAE: unique entries of SHA, PHI and SAE (509 sequences)
`
`The resulting ratio of potent to non-potent entries for each database is given
`in Table 2. The combined set has a ratio of approximately 1:1.
`The full dataset for our study contained 6483 unique and experimentally
`validated 19nt siRNA sequences in anti-sense description. To classify siRNA
`entries from the various datasets in terms of product levels, we employed
`an arbitrary, but commonly used, threshold of 0.3. Product levels <0.3 and
`≥0.3 indicated potent and non-potent siRNA, respectively. This threshold
`is only employed for initial classification; the actual potency threshold
`changes to satisfy the 95% specificity criterion, as explained in detail in
`the Supplementary Material.
`
`2.4 Algorithm
`A strictly data-driven approach was followed for the model selection with
`no a priori assumptions about biological significance or relative importance
`of features were incorporated into our model at any time. All features were
`considered equally likely to affect siRNA potency.
`Our model exploration algorithm generated models by sampling features
`from their posterior probability given the data,
`through the stochastic
`selection from the 497-element feature set and calculating the respective
`logistic regression parameters for subsets of the features (see details below).
`Each generated model was a candidate for inclusion in the final model set,
`which was then used to make predictions of siRNA potency, in the form of
`remaining product levels after siRNA activity.
`A BIC (Bayesian Information Criterion) scheme was employed as a
`probabilistic scoring measure. The BIC approximates for any feature set the
`log posterior probability under a Bayesian model (cf. Kass and Raftery, 1995)
`and is proportional to L– 1/2 p log n, where L is the maximum log-likelihood,
`n is the number of samples in the training set, and p the number of selected
`features. The BIC is an approximation to the Bayesian marginal likelihood
`which integrates out over uncertainty in parameter coefficient values. The
`Bayes marginal likelihood is known to contain a natural penalty against
`over-complex models and hence parsimonious models are preferred under
`this approach (cf. Bernardo and Smith, 2000). We use a Markov chain Monte
`Carlo (MCMC) algorithm, employing the BIC as an approximation to the
`true marginal likelihood of any model, to generate models (or feature sets)
`in proportion to their posterior probability. MCMC is a generic approach to
`generating samples from a complex target distribution and is well suited to
`the task of variable set uncertainty (cf. Gelman et al., 2003)
`Hence we probe the 497-dimensional feature space using a MCMC
`algorithm, as detailed below, from which we are able to characterize feature
`
`Approximate Bayesian feature selection
`
`set relevance, (cf. Kass and Raftery, 1995; Kass and Wasserman, 1995;
`Raftery, 1995).
`The algorithm starts from an initial feature set and evolves it over time in
`a stepwise manner, which is outlined in the following:
`
`(i) INITIAL STEP: An initial feature set is chosen and the respective BIC-
`weighted log-likelihood of the model, comprised by these features, is
`calculated. The feature list and the penalized log-likelihood are referred
`to as features_active and result_active in the following.
`
`For each step one of the following is done with equal chance:
`
`(a) REMOVE: Randomly remove a feature from features_active
`(b) ADD: Randomly add a feature to features_active
`(c) SWAP: randomly add a feature to features_active while removing
`another feature at the same time
`
`(ii) The modified feature list is then stored as features_proposed and the
`BIC of the new model is calculated and stored in result_proposed
`(iii) IF result_proposed > result_active (the proposed model performs
`better)
`
`(a) Accept the proposed model
`(b) features_proposed becomes features_active
`(c) result_proposed becomes result_active
`
`(iv) ELSE (the proposed model performs worse)
`
`(a) Calculate a measure of change between result_proposed and
`result_active as
`(b) α = exp(result_proposed – result_active),
`(c) Draw a random number from a unit distribution RAND∼U(0,1).
`(d) IF RAND ≤ α,
`(1) Accept the proposed model even though it performs worse
`than the active one
`(2) features_proposed becomes features_active
`(3) result_proposed becomes result_active
`
`(e) ELSE
`(1) Reject the proposed model
`(2) features_active remains unchanged
`(3) result_active remains unchanged
`
`(v) Store features_active in features_sets(i)
`(vi) REPEAT steps 1, 2 for i = 1,2,…,NRUNS
`(vii) Skip the first NSKIP entries of features_sets(i) which represent the burn-
`in phase of the algorithm, containing more than average variations in
`the features selected.
`(viii) Check the remaining features sets for unique feature sets and store them
`in unique_feature_sets. Store the number of percental appearance of
`each unique feature set in kMODEL
`(ix) Calculate predictions for all of the unique models, each being defined
`by an entry in unique_feature_sets
`(x) Weight each prediction according to the respective entry in kMODEL.
`
`(a) Both unique_feature_sets and kMODEL contain NUNIQUE entries.
`
`(xi) FINAL STEP: The final prediction of our Monte-Carlo algorithm is a
`kMODEL-weighted average of the predictions of the NUNIQUE models
`defined in unique_feature_sets.
`
`The above scheme samples the feature sets according to their posterior
`probability measure Pr( feature set | data). This in turn means that multiple
`models with high predictive power are explored and, due to the employed
`model averaging, the predictive power of a large set of individual models is
`combined into one single prediction.
`
`1597
`
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`
`
`J.W.Klingelhoefer et al.
`
`Table 3. Percentile occurrences of the ‘UCU’ and ’ACGA’ motifs
`
`Table 4. List of features that were most dominant in the generated models
`
`Dataset
`
`All datasets (%)
`
`Excluding SIR (%)
`
`Feat. ID Feat No Occurrence (%) Corr. Coeff. Feature explanation
`
`Potent entries (all samples)
`Potent entries (only samples
`containing ‘UCU’)
`Potent entries (only samples
`containing ‘ACGA’)
`
`51.7
`60.1
`
`36.5
`
`41.5
`49.4
`
`25.4
`
`Comparison of percentile occurrences of the ‘UCU’ and ‘ACGA’ motifs in potent siRNA
`sequences for the case that all five of our datasets (SLO, NOV, KAT, SAE, SIR) are
`considered and the case that the SIR dataset is excluded.
`
`3 RESULTS
`We ran our Bayesian Markov Chain Monte Carlo (MCMC) feature
`selection algorithm for 1 100 000 iterations—including a 100 000
`iteration burn-in phase—starting three times with the Matveeva et al.
`Model2 (Matveeva et al., 2007) as initial feature set, and two more
`times with initial models composed of randomly picked features. For
`each initial model, we trained our algorithm on five different training
`datasets and then went on to test its performance on a test dataset.
`The training datasets for each of these runs were composed of four
`of our five datasets (SLO, NOV, KAT, SAE, SIR) and the remaining
`dataset was used as the test dataset (leave-one-out scheme). By using
`all five possible combinations, we arrived at a total of 25 runs (five
`initial models times five training/test dataset combinations), together
`representing 25 000 000 iterations of our algorithm.
`The output of each of these 25 runs consisted of a list of percentile
`appearance of each of the 497 features during the 1 000 000 Bayesian
`MCMC iterations of the respective run. In a next step, all 25 lists
`retrieved from our runs were combined and the overall percentile
`occurrence of each feature in all 25 runs (representing 25 000 000
`iterations of our algorithm) was calculated. A final list with the 19
`features most dominant in all models generated by the Bayesian
`MCMC algorithm is depicted in the Table 3. These are features
`contained in more than 33% of the models generated in all 25
`runs. This cut-off is arbitrary and was chosen so that the number
`of features selected by our algorithm matches that of the model
`against which we validated our results (Matveeva et al., 2007, cf.
`Supplementary Material). The results of the model, comprised of
`only these 19 features, compare favourably with the performance
`of the most recent algorithms, further validating our approach to
`feature selection (cf. Supplementary Material).
`By using varying initial models for our Bayesian MCMC
`algorithm, we ensured that no bias was present in our simulations.
`Furthermore, the fact that all runs of the algorithm—irrespectively of
`the initial feature set—lead to comparable results indicates that our
`algorithm converged. This is reinforced by the fact that the average
`model sizes do not change significantly between different runs (cf.
`Supplementary Material).
`The fact that only thirteen of the 497 features were present in
`more than 50% of all Monte-Carlo generated models can be taken
`as an indicator of the complexity of the silencing mechanism and
`the variability of the factors that can potentially affect it. One could
`argue that the vast feature space would have led to some degree
`of overlap, resulting in a spreading of the intensity of a stronger
`signal over a range of features. The inclusion of a SWAP step in
`our Bayesian MCMC algorithm, as well as of features spanning
`different classes (compositional, motifs, structural, thermodynamic)
`
`1598
`
`1
`2
`
`3
`4
`5
`6
`
`7
`8
`9
`10
`11
`12
`
`13
`14
`15
`
`16
`
`17
`18
`
`19
`
`11
`140
`
`20
`40
`433
`210
`
`437
`38
`34
`483
`426
`431
`
`2
`491
`125
`
`450
`
`492
`259
`
`347
`
`94.84
`93.24
`
`88.84
`84.84
`84.54
`78.65
`
`75.42
`69.76
`67.75
`62.84
`61.13
`58.50
`
`58.13
`42.33
`39.40
`
`37.30
`
`35.80
`33.49
`
`32.84
`
`NT10 is ‘A’ (cleavage site)
`0.1029
`0.1485 Motif ‘UCU’ is present in
`siRNA
`−0.1213
`NT19 is ‘A’
`0.2176
`NT1 is ‘U’
`0.2709 G in NT1..NT4 (dG1-4)
`−0.0972 Motif ‘ACGA’ is present
`in siRNA
`0.1492 G in NT5..NT8 (dG5-8)
`−0.0034
`NT18 is ‘G’
`−0.1045
`NT14 is ‘G’
`0.0415
`GC content > 35%
`0.1286 G in NT13..NT14
`−0.1495 G in NT18..NT19
`(dG18-19)
`0.0884
`NT1 is ‘A’
`0.2009
`GC content < 70%
`−0.1323 Motif ‘GCC’ is present in
`siRNA
`−0.1957
`Folding is present in
`siRNA (binary value)
`GC content < 75%
`0.2280
`−0.0911 Motif ‘GUGG’ is present
`in siRNA
`0.0217 Motif ‘UCCG’ is present
`in siRNA
`
`List of features that were most dominant in the models generated by our Bayesian
`Markov chain Monte Carlo algorithm. The columns depict the overall percentile
`appearance of each feature (100% representing an appearance of 25 000 000 times in
`the models generated by our algorithm), the correlation between the feature and the
`product level variable (positive correlation coefficient indicates an increasing of siRNA
`potency), as well as its biological meaning. The first thirteen features appear in more
`than 50% of the runs.
`
`in our analysis should solve overlap issues though, minimising any
`masking of significant features by less influential ones.
`
`3.1 Motifs
`It is particularly interesting that one of the most important features
`(cid:3)
`(cid:3)
`detected by our algorithm is the 3-mer 5
`–3
`motif, ‘UCU’, which
`has, in the main, escaped the notice of previous siRNA potency
`studies. More specifically, Teramoto et al. (Teramoto2005) do not
`report it at all, whereas Vert et al. (Vert et al., 2007) include it in their
`table of included features, but do not make any special reference to
`it in their main text or any other distinction between ‘UCU’ and all
`other motifs studied by them.
`‘UCU’ has a positive correlation coefficient, which means that it
`increases siRNA potency. Including all datasets, 60% of the siRNAs
`containing the motif are potent for a product level threshold of 0.3
`(cf. Table 4). More specifically, motif ‘UCU’ was found present at
`least once in 2560 siRNAs, of which 1538 were potent. Exclusion
`of the SIR dataset, where the product level was stated as an ordinal
`and not as a continuous value, led to qualitatively similar findings,
`strengthening the confidence in this finding. A detailed breakdown
`of the motif’s occurrences by position in the siRNA sequence can
`be found in Supplementary Material.
`
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`
`
`It is important to highlight that the positive correlation found
`between ‘UCU’ and silencing efficacy is unlikely to reflect
`a structural or compositional characteristic of
`the sequences
`containing the specific motif, as such features were checked for
`correlation by our Bayesian MCMC algorithm, but were not selected
`as significant. Other reasons, such as a possible codon-specific bias,
`can also be excluded (cf. Table 4). Hence, this motif appears to
`complement the thermodynamic and compositional characteristics
`that have been previously found to affect siRNA potency, seemingly
`having a role of its own in the silencing process.
`Having