throbber
BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 13 2009, pages 1594–1601
`
`doi:10.1093/bioinformatics/btp284
`
`Genome analysis
`Approximate Bayesian feature selection on a large meta-dataset
`offers novel insights on factors that effect siRNA potency
`Jochen W. Klingelhoefer1,†, Loukas Moutsianas2,† and Chris Holmes2,3,∗
`1Department of Biochemistry, University of Oxford, Oxford, OX1 3QU, 2Department of Statistics, University of Oxford,
`Oxford OX1 3TG and 3MRC Harwell, Oxon, OX11 0RD, UK
`Received on January 28, 2009; revised on April 19, 2009; accepted on April 22, 2009
`Advance Access publication May 6, 2009
`Associate Editor: Alison
`
`ABSTRACT
`Motivation:
`Short interfering RNA (siRNA)-induced RNA inter-
`ference is an endogenous pathway in sequence-specific gene
`silencing. The potency of different siRNAs to inhibit a common
`target varies greatly and features affecting inhibition are of high
`current interest. The limited success in predicting siRNA potency
`being reported so far could originate in the small number and the
`heterogeneity of available datasets in addition to the knowledge-
`driven, empirical basis on which features thought to be affecting
`siRNA potency are often chosen. We attempt to overcome these
`problems by first constructing a meta-dataset of 6483 publicly
`available siRNAs (targeting mammalian mRNA), the largest to date,
`and then applying a Bayesian analysis which accommodates feature
`set uncertainty. A stochastic logistic regression-based algorithm is
`designed to explore a vast model space of 497 compositional,
`structural and thermodynamic features, identifying associations with
`siRNA potency.
`Results: Our algorithm reveals a number of features associated
`with siRNA potency that are, to the best of our knowledge, either
`under reported in literature, such as anti-sense 5(cid:3)–3(cid:3) motif ‘UCU’, or
`not reported at all, such as the anti-sense 5(cid:3)-3(cid:3) motif ‘ACGA’. These
`findings should aid in improving future siRNA potency predictions
`and might offer further insights into the working of the RNA-induced
`silencing complex (RISC).
`Contact: cholmes@stats.ox.ac.uk
`Supplementary information: Supplementary data are available at
`Bioinformatics online.
`
`1 INTRODUCTION
`RNA interference (RNAi) is a post-transcriptional gene silencing
`(PTGS) mechanism which inhibits gene expression. It is mediated
`by double-stranded RNA (dsRNA) or by transcripts that form stem-
`loops (short hairpin RNA, shRNA).
`During the process of RNAi, dsRNA/shRNA is split by the
`Ribonuclease III enzyme Dicer into 19-nt dsRNA molecules with
`(cid:3)
`2-nt 3
`-overhangs, named short interfering RNA (siRNA, depicted
`in Figure 1). The siRNA then interacts with the RNA-induced
`silencing complex (RISC), specifically with its catalytic component,
`
`∗
`To whom correspondence should be addressed.
`†The authors wish it to be known that, in their opinion, the first two authors
`should be considered as joint First authors.
`
`Fig. 1. Structure of short
`interfering RNA (siRNA). Both guide and
`passenger strands are displayed and the guide strand nucleotides are
`(cid:3)
`(cid:3)
`numbered in 5
`to 3
`direction.
`
`the Argonaute protein. RISC-Argonaute separates the two strands
`of the siRNA molecule into the guide strand (anti-sense to the
`targeted mRNA), which is loaded into the Argonaute protein,
`and the passenger strand (sense to the targeted mRNA), which is
`released into the cytoplasm and subsequently degraded (Lodish
`et al., 2004). In a following step,
`the RISC–siRNA complex
`probes cytoplasmic mRNA molecules for sequence complementarity
`to the loaded siRNA guide strand. Matching mRNA molecules
`are cut around the center of the siRNA–mRNA interaction site
`(cleavage site) by the Argonaute protein, effectively inhibiting the
`expression of the respective gene product. Wang et al. have recently
`published a crystallographic structure of the complex, taken from
`an eubacterium (Wang et al., 2008).
`Not long after its discovery, it was known that not all siRNAs are
`equally potent (Holen et al. 2002). The search for potent siRNAs
`and for features that could explain their silencing capabilities has
`since become the focus of numerous research groups in the field.
`Initially, most groups conducted experiments on smaller sets of
`mRNAs. They measured the capability of different siRNA sequences
`to silence their respective gene products and tried to identify features
`that could be linked to the reported siRNA potency. Until recently,
`the focus was on compositional or thermodynamic features, both
`of which are sequence-based. Compositional features describe the
`occurrences of certain nucleotides at certain positions of the siRNA
`sequences, whereas thermodynamic features are concerned with
`binding free energies and stabilities of the sequences. A third group
`of features used in literature are structure-based ones (Shao et al.,
`2006, 2007; Vickers et al., 2003). These include secondary structure
`characteristics of both the siRNA and its mRNA target. According to
`Patzel et al., siRNAs with no defined secondary structure correlate
`with increased potency (Patzel et al., 2005). Moreover, one would
`expect target accessibility to be decisive for whether an otherwise
`
`© 2009 The Author(s)
`This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
`by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`Alnylam Exh. 1056
`
`

`

`Approximate Bayesian feature selection
`
`on siRNAs that target mammalian mRNA). As far as the authors
`are aware this is the largest study of its kind to date. By following a
`purely data-driven approach, we hoped to confirm most of the recent
`findings, to resolve cases with contrasting evidence from different
`studies and to discover novel features that associate with siRNA
`potency.
`in order to
`A stochastic approach was considered essential,
`efficiently explore the vast feature space of 497 compositional,
`thermodynamic and structural features that were put together to be
`tested for association with siRNAs potency. These covered most,
`if not all, of the features reported in recent studies, resulting in the
`most extensive set of features worked on so far. As a result, we
`present an efficient algorithm which succeeds in identifying novel
`features that significantly affect siRNA potency while performing
`comparably to most successful recent prediction methods (see
`Supplementary Material for more details). We then employ our
`algorithm to suggest 10 potent siRNAs for each of the human
`mRNA listings in the NCBI RefSeq database (Pruit et al., 2007;
`http://www.ncbi.nlm.nih.gov/RefSeq). These results can be found
`at our website.
`
`2 METHODS
`2.1 Consensus format for siRNA sequences
`The following set of rules was chosen in this study, defining how siRNA
`sequences and the measurement of their respective silencing efficacy are
`presented:
`
`(cid:3)
`
`(cid:3)
`
`to 3
`
`well-designed siRNA can bind to its complementary part on the
`mRNA sequence (Ding et al., 2004). Researchers who incorporated
`mRNA characteristics in their feature set reported that, although
`they appear to be correlated with siRNA potency when no additional
`features were selected, they seemed to offer little to the predictive
`strength of their models, when added to an existing set of sequence-
`based features (Katoh and Suzuki, 2007; Peek, 2007). Others,
`however, did report correlation of site accessibility with siRNA
`potency (Shao et al., 2006).
`Some researchers prefer to work on purpose-build datasets
`produced by them (Katoh and Suzuki, 2007; Reynolds et al., 2004;
`Shao et al., 2006; Ui-Tei et al., 2004), which are usually small
`and target only a few mRNAs. By contrast, others perform their
`analysis on combined data from heterogeneous sources (Holen,
`2006; Matveeva et al., 2007; Saetrom, 2004; Shabalina et al., 2006;
`Vert et al., 2007). The latter is becoming increasingly popular, as
`the amount of publicly available data steadily increases. The large
`dataset by Huesken et al. (Huesken et al., 2005) can be seen as a
`landmark towards this direction.
`A problem with studying heterogeneous datasets though is the
`variability in the biological methods employed, as well as in the
`information provided to the scientific community. As an example,
`earlier studies tended to employ 19-nt sequences (Ui-Tei et al.,
`2004), while more recent studies use 21-nt siRNA sequences,
`(cid:3)
`taking the 2-nt 3
`siRNA overhangs into account. These 2-nt
`overhangs appear to be correlated with siRNA potency (Huesken
`et al., 2005); however, this correlation cannot be tested on a 19-nt
`dataset. Matveeva et al. encounter this problem in their recent study
`(Matveeva et al., 2007), as they build two prediction models based
`on a 21-nt dataset, but can only test them on three 19-nt datasets.
`Another problem encountered when combining data from
`heterogeneous sources is the lack of detailed information regarding
`the target sequence, such as which particular transcript was targeted
`and what the exact sequence of that transcript is. Such issues render
`the incorporation of structural mRNA information into a model
`challenging.
`Having gathered the experimental data, a variety of statistical
`methods can be employed in an attempt to determine which features
`of a candidate siRNA sequence influence its potency. These include
`simple linear regression-based methods (Matveeva et al., 2007;
`Shabalina et al., 2006; Ui-Tei et al., 2004; Vert et al., 2007) as
`well as more complex methods like neural networks (Huesken
`et al., 2005; Shabalina et al., 2006), Euler graphs (Pancoska
`et al., 2004), Support Vector Machines (Ladunga, 2007; Peek,
`2007; Saetrom, 2004; Teramoto et al., 2005), genetic programming
`(Saetrom, 2004) and disjunctive rule merging (Gong et al., 2008).
`With the performance of linear regression-based methods shown to
`be comparable to most of the more complex methods (Matveeva
`et al., 2007; Shabalina et al., 2006; Vert et al., 2007), linear
`regression has proven to be a reasonable choice, given its simplicity
`of concept and implementation and the interpretability of the results
`it produces. For these reasons, it was decided to be employed in this
`study.
`We have investigated an approximate Bayesian Markov chain
`Monte Carlo feature selection algorithm using a Bayesian
`Information Criteria (BIC)
`(cf. Kass and Raftery, 1995)
`to
`approximate the Bayes factor for a logistic regression model on a
`large meta-dataset of 6483 siRNA’s that we constructed employing
`the most widely used, publicly available datasets (we focus only
`
`• siRNA sequences are stated as anti-sense sequences from 5
`(Figure 1)
`• Potency (synonymous to efficiency and efficacy) refers to the ability of
`an siRNA to inhibit (synonymous to knock-down and down-regulate)
`a gene product
`• Product level refers to the percentage of the gene product remaining
`after siRNA-mediated RNA interference
`• Product levels of zero and one are assigned to fully potent and non-
`potent siRNAs, respectively
`• Only 19-nt siRNA sequences are employed in this study (3
`overhangs are neglected)
`• Compositional features are represented with the initial of the base
`followed by its position on the anti-sense strand, e.g. U7 means Uracil
`at nucleotide position 7 of the siRNA anti-sense strand.
`
`(cid:3)
`
`siRNA
`
`2.2 Features
`Four hundred and ninety-seven features have been included in our systematic
`study spanning position-dependent nucleotide preferences, GC content of
`the siRNA sequence, presence of 2-, 3- and 4-mer sequence motifs,
`presence of known innate inferon response-stimulating motifs, occurrence of
`palindromes, thermodynamic features and structural features of the siRNA.
`Most of the features that were tested in recent studies and found to be
`correlated with siRNA potency have been included in our feature set. In
`addition, the set was enriched by features that were not been tested before
`(e.g. tetranucleotide free energy differences). For a detailed description of
`each group of features and an exhaustive list of the features included in this
`study, please refer to the Supplementary Material.
`
`2.3 Databases
`We collected data from the online database siRecords (Ren et al.,
`2009; http://sirecords.umn.edu/siRecords) and from recently published
`
`1595
`
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`

`

`J.W.Klingelhoefer et al.
`
`Table 1. Overview of datasets employed in this study
`
`Dataset
`
`Size
`
`Reference
`
`Strand
`Format
`
`SiRNA
`concentration
`
`Potency
`
`Also contained in
`
`siRecords (SIR)
`Sloan–Kettering
`(SLO)
`Isis (ISI)
`Novartis (NOV)
`Katoh (KAT)
`Shabalina (SHA)
`Saetrom (SAE)
`Phipps (PHI)
`Amgen–
`Dharmacon
`(AMG)
`
`2881
`601
`
`67
`2431
`702
`653
`537
`26
`239
`
`Ren et al. (2009)
`Jagla et al. (2005)
`
`Vickers et al. (2003)
`Huesken et al. (2005)
`Katoh and Suzuki (2007)
`Shabalina et al. (2006)c
`Saetrom (2004)c
`Phipps et al. (2004)
`Reynolds et al.
`(2004)
`Khvorova et al. (2003)
`
`S
`AS
`
`AS
`AS
`S
`AS
`AS
`S
`AS
`
`Variablea
`100 nM
`
`100 nM
`50 nM
`10/25 nM
`Variablea
`Variablea
`300 nM
`100 nM
`
`Four classes
`[0, 1]
`
`–
`–
`
`[0, 1]
`[0, 1]
`[0, 1]b
`[0, 1]
`[0, 1]
`[0, 1]b
`[0, 1]
`
`SHA, SAE, SIR
`–
`–
`SAE, SIR
`–
`SIR
`SHA, SAE, SIR
`
`The columns for dataset, size and reference refer to the name and abbreviation, the number of siRNA samples contained and the reference to the dataset, respectively. Strand format
`indicates which strand is reported in the original study: sense (S) or antisense (AS). In the next column the siRNA concentration, as reported in the respective study, is stated.
`Furthermore, potency is either reported in a continuous scale over [0, 1], where 0 represents fully potent and 1 non-potent or as discrete value, where samples are split into different
`potency classes.
`aFor these datasets, siRNA data has been collected from various experiments, all at slightly different experimental condition, so a common value for siRNA concentration used
`cannot be stated.
`bIn these datasets, potency was reported in a continuous scale over [0, 1], but fully potent entries were represented by 1, and non-potent by 0.
`cFor these combined datasets, a reference list of the individual datasets they contain is given in the main article.
`
`(Katoh and Suzuki, 2007;
`datasets, which were either purpose-built
`Phipps et al., 2004) or collections of other, earlier published ones (Matveeva
`et al., 2007; Saetrom, 2004; Shabalina et al., 2006). In a pre-processing step,
`we limited our data to siRNA sequences targeting mammalian mRNA.
`We focused at siRecords as it represents one of the largest resources
`of curated siRNA data, which is accessible to academic/non-commercial
`users by download (current release: 18 Aug 2008). Other online databases,
`such as HuSiDa (http://itb.biologie.hu-berlin.de/∼nebulus/sirna/index.htm)
`or siR (Reynolds2004, http://www.mpibpc.gwdg.de/abteilungen/100/105/
`sirna.html), are available and will be considered for inclusion into future
`versions of our dataset.
`In our report, we refer to the combined datasets of Shabalina et al.
`(Shabalina et al., 2006) and Saetrom et al. (Saetrom, 2004) as SHA
`and SAE, respectively to highlight the fact that we obtained the data
`from the aforementioned sources. These datasets though are themselves a
`heterogeneous set of data produced by many different groups (Aza-Blanc et
`al., 2003; Giddings et al., 2000; Harboth et al., 2003; Holen et al., 2002;
`Hsieh et al., 2004; Jackson et al., 2003; Kawasaki et al., 2003; Khvorova
`et al., 2003; Kumar et al., 2003; Reynolds et al., 2004; Ui-Tei et al., 2004;
`Vickers et al., 2003). A summary, highlighting the key features of each dataset
`and a reference for each, is presented in Table 1.
`The task of combining data from heterogeneous sources into a single
`meta-analysis study is not straightforward. Two of the key problems are
`(i) the inconsistency in measuring and reporting siRNA potency (Jagla
`et al., 2005) and (ii) the fact that reported results might refer to either
`sense or antisense strand siRNA sequences, and the necessary transformation
`to make findings comparable might introduce errors (Leuschner et al.,
`2006). Both of these problems are addressed in detail in the Supplementary
`Material.
`It should be highlighted that in studies which are not specifically aimed
`at assessing silencing efficacy of siRNA sequences and where candidate
`samples have been picked with the help of a design algorithm the reported
`siRNAs will not represent the full spectrum of possible sequences, eventually
`introducing a bias. Moreover, in such cases often only the potent siRNAs are
`reported, leading to over-representation of potent siRNAs over non-potent,
`as is the case for the siRecords (SIR) database. Combining data from various
`
`sources should aid in reducing these biases, allowing us to better explore the
`siRNA sequence state space.
`Another issue to be taken into consideration for all siRNA potency
`studies is the concentration of siRNA used in the silencing experiment.
`It can be seen in Table 1 that the concentrations of siRNA employed
`in the various experiments vary. It is generally accepted that silencing
`efficiency will be compromised when using too low siRNA concentrations.
`On the other hand, too high concentrations can induce non-specific, off-target
`effects (Persengiev et al., 2004), which can be reduced, albeit not alleviated
`(Jackson2003b), by reducing the siRNA concentration. Even though some
`groups report very similar siRNA potencies for concentrations varying
`between 20–100 nM (Jackson et al., 2003; Semizarov et al., 2003), there
`appears to be no consensus on the optimal amount to use—a value, which
`might well be gene- and tissue-specific.
`Although a breakdown of the data according to siRNA concentration
`employed in the study might have increased the power of our study, we
`decided not to pool our data due to the already small number of available
`siRNA samples (compared to the vast siRNA state space). Given that a
`slight concentration-specific difference in silencing efficiency is likely to
`exist in our dataset, we expect a small artificial increase in variance, which
`we deem unlikely to have any effect on the selection of the most important
`predictors.
`Concentrations were not reported for the siRecords database. However, as
`will be shown in detail below, the leave-one-out approach followed by our
`algorithm yielded similar results qualitatively when SIR data was excluded
`from the training set. This could mean that the siRNA concentrations
`employed to derive most of the entries in SIR are similar to those in
`the other datasets, or that the feature selection process is not significantly
`affected by difference in concentrations amongst datasets, or both. A proof
`of this statement will be left to a follow-up study which will hopefully
`be conducted in the light of a larger available dataset and more complete
`data.
`Starting with our pool of nine datasets (SLO, ISI, AMG, NOV, KAT, SHA,
`SAE, SIR, PHI), all datasets were first checked for uniqueness and then
`cross-checked against each other to retrieve a set of unique and independent
`datasets. ISI, AMG and SHA were found to be fully contained in SIR and
`
`1596
`
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`

`

`Table 2. Ratio of potent to non-potent entries contained in the datasets
`
`Dataset
`
`SLO
`NOV
`KAT
`SAE
`SIR
`
`Total
`
`Size
`
`601
`2431
`702
`509
`2240
`
`6438
`
`Potent
`
`Non-potent
`
`Ratio
`
`179
`1222
`176
`197
`1577
`
`3351
`
`422
`1209
`526
`312
`663
`
`3132
`
`0.4242
`1.0108
`0.3346
`0.6314
`2.3786
`
`1.067
`
`SAE, they were dropped from further investigation, leaving the following
`datasets to be included in this study:
`
`• SLO: SLO complete (601 samples)
`• NOV: NOV complete (2431 sequences)
`• KAT: KAT complete (702 sequences)
`• SIR: SIR without entries from ISI, AMG, KAT, SHA, PHI and SAE
`(2240 sequences)
`• SAE: unique entries of SHA, PHI and SAE (509 sequences)
`
`The resulting ratio of potent to non-potent entries for each database is given
`in Table 2. The combined set has a ratio of approximately 1:1.
`The full dataset for our study contained 6483 unique and experimentally
`validated 19nt siRNA sequences in anti-sense description. To classify siRNA
`entries from the various datasets in terms of product levels, we employed
`an arbitrary, but commonly used, threshold of 0.3. Product levels <0.3 and
`≥0.3 indicated potent and non-potent siRNA, respectively. This threshold
`is only employed for initial classification; the actual potency threshold
`changes to satisfy the 95% specificity criterion, as explained in detail in
`the Supplementary Material.
`
`2.4 Algorithm
`A strictly data-driven approach was followed for the model selection with
`no a priori assumptions about biological significance or relative importance
`of features were incorporated into our model at any time. All features were
`considered equally likely to affect siRNA potency.
`Our model exploration algorithm generated models by sampling features
`from their posterior probability given the data,
`through the stochastic
`selection from the 497-element feature set and calculating the respective
`logistic regression parameters for subsets of the features (see details below).
`Each generated model was a candidate for inclusion in the final model set,
`which was then used to make predictions of siRNA potency, in the form of
`remaining product levels after siRNA activity.
`A BIC (Bayesian Information Criterion) scheme was employed as a
`probabilistic scoring measure. The BIC approximates for any feature set the
`log posterior probability under a Bayesian model (cf. Kass and Raftery, 1995)
`and is proportional to L– 1/2 p log n, where L is the maximum log-likelihood,
`n is the number of samples in the training set, and p the number of selected
`features. The BIC is an approximation to the Bayesian marginal likelihood
`which integrates out over uncertainty in parameter coefficient values. The
`Bayes marginal likelihood is known to contain a natural penalty against
`over-complex models and hence parsimonious models are preferred under
`this approach (cf. Bernardo and Smith, 2000). We use a Markov chain Monte
`Carlo (MCMC) algorithm, employing the BIC as an approximation to the
`true marginal likelihood of any model, to generate models (or feature sets)
`in proportion to their posterior probability. MCMC is a generic approach to
`generating samples from a complex target distribution and is well suited to
`the task of variable set uncertainty (cf. Gelman et al., 2003)
`Hence we probe the 497-dimensional feature space using a MCMC
`algorithm, as detailed below, from which we are able to characterize feature
`
`Approximate Bayesian feature selection
`
`set relevance, (cf. Kass and Raftery, 1995; Kass and Wasserman, 1995;
`Raftery, 1995).
`The algorithm starts from an initial feature set and evolves it over time in
`a stepwise manner, which is outlined in the following:
`
`(i) INITIAL STEP: An initial feature set is chosen and the respective BIC-
`weighted log-likelihood of the model, comprised by these features, is
`calculated. The feature list and the penalized log-likelihood are referred
`to as features_active and result_active in the following.
`
`For each step one of the following is done with equal chance:
`
`(a) REMOVE: Randomly remove a feature from features_active
`(b) ADD: Randomly add a feature to features_active
`(c) SWAP: randomly add a feature to features_active while removing
`another feature at the same time
`
`(ii) The modified feature list is then stored as features_proposed and the
`BIC of the new model is calculated and stored in result_proposed
`(iii) IF result_proposed > result_active (the proposed model performs
`better)
`
`(a) Accept the proposed model
`(b) features_proposed becomes features_active
`(c) result_proposed becomes result_active
`
`(iv) ELSE (the proposed model performs worse)
`
`(a) Calculate a measure of change between result_proposed and
`result_active as
`(b) α = exp(result_proposed – result_active),
`(c) Draw a random number from a unit distribution RAND∼U(0,1).
`(d) IF RAND ≤ α,
`(1) Accept the proposed model even though it performs worse
`than the active one
`(2) features_proposed becomes features_active
`(3) result_proposed becomes result_active
`
`(e) ELSE
`(1) Reject the proposed model
`(2) features_active remains unchanged
`(3) result_active remains unchanged
`
`(v) Store features_active in features_sets(i)
`(vi) REPEAT steps 1, 2 for i = 1,2,…,NRUNS
`(vii) Skip the first NSKIP entries of features_sets(i) which represent the burn-
`in phase of the algorithm, containing more than average variations in
`the features selected.
`(viii) Check the remaining features sets for unique feature sets and store them
`in unique_feature_sets. Store the number of percental appearance of
`each unique feature set in kMODEL
`(ix) Calculate predictions for all of the unique models, each being defined
`by an entry in unique_feature_sets
`(x) Weight each prediction according to the respective entry in kMODEL.
`
`(a) Both unique_feature_sets and kMODEL contain NUNIQUE entries.
`
`(xi) FINAL STEP: The final prediction of our Monte-Carlo algorithm is a
`kMODEL-weighted average of the predictions of the NUNIQUE models
`defined in unique_feature_sets.
`
`The above scheme samples the feature sets according to their posterior
`probability measure Pr( feature set | data). This in turn means that multiple
`models with high predictive power are explored and, due to the employed
`model averaging, the predictive power of a large set of individual models is
`combined into one single prediction.
`
`1597
`
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`

`

`J.W.Klingelhoefer et al.
`
`Table 3. Percentile occurrences of the ‘UCU’ and ’ACGA’ motifs
`
`Table 4. List of features that were most dominant in the generated models
`
`Dataset
`
`All datasets (%)
`
`Excluding SIR (%)
`
`Feat. ID Feat No Occurrence (%) Corr. Coeff. Feature explanation
`
`Potent entries (all samples)
`Potent entries (only samples
`containing ‘UCU’)
`Potent entries (only samples
`containing ‘ACGA’)
`
`51.7
`60.1
`
`36.5
`
`41.5
`49.4
`
`25.4
`
`Comparison of percentile occurrences of the ‘UCU’ and ‘ACGA’ motifs in potent siRNA
`sequences for the case that all five of our datasets (SLO, NOV, KAT, SAE, SIR) are
`considered and the case that the SIR dataset is excluded.
`
`3 RESULTS
`We ran our Bayesian Markov Chain Monte Carlo (MCMC) feature
`selection algorithm for 1 100 000 iterations—including a 100 000
`iteration burn-in phase—starting three times with the Matveeva et al.
`Model2 (Matveeva et al., 2007) as initial feature set, and two more
`times with initial models composed of randomly picked features. For
`each initial model, we trained our algorithm on five different training
`datasets and then went on to test its performance on a test dataset.
`The training datasets for each of these runs were composed of four
`of our five datasets (SLO, NOV, KAT, SAE, SIR) and the remaining
`dataset was used as the test dataset (leave-one-out scheme). By using
`all five possible combinations, we arrived at a total of 25 runs (five
`initial models times five training/test dataset combinations), together
`representing 25 000 000 iterations of our algorithm.
`The output of each of these 25 runs consisted of a list of percentile
`appearance of each of the 497 features during the 1 000 000 Bayesian
`MCMC iterations of the respective run. In a next step, all 25 lists
`retrieved from our runs were combined and the overall percentile
`occurrence of each feature in all 25 runs (representing 25 000 000
`iterations of our algorithm) was calculated. A final list with the 19
`features most dominant in all models generated by the Bayesian
`MCMC algorithm is depicted in the Table 3. These are features
`contained in more than 33% of the models generated in all 25
`runs. This cut-off is arbitrary and was chosen so that the number
`of features selected by our algorithm matches that of the model
`against which we validated our results (Matveeva et al., 2007, cf.
`Supplementary Material). The results of the model, comprised of
`only these 19 features, compare favourably with the performance
`of the most recent algorithms, further validating our approach to
`feature selection (cf. Supplementary Material).
`By using varying initial models for our Bayesian MCMC
`algorithm, we ensured that no bias was present in our simulations.
`Furthermore, the fact that all runs of the algorithm—irrespectively of
`the initial feature set—lead to comparable results indicates that our
`algorithm converged. This is reinforced by the fact that the average
`model sizes do not change significantly between different runs (cf.
`Supplementary Material).
`The fact that only thirteen of the 497 features were present in
`more than 50% of all Monte-Carlo generated models can be taken
`as an indicator of the complexity of the silencing mechanism and
`the variability of the factors that can potentially affect it. One could
`argue that the vast feature space would have led to some degree
`of overlap, resulting in a spreading of the intensity of a stronger
`signal over a range of features. The inclusion of a SWAP step in
`our Bayesian MCMC algorithm, as well as of features spanning
`different classes (compositional, motifs, structural, thermodynamic)
`
`1598
`
`1
`2
`
`3
`4
`5
`6
`
`7
`8
`9
`10
`11
`12
`
`13
`14
`15
`
`16
`
`17
`18
`
`19
`
`11
`140
`
`20
`40
`433
`210
`
`437
`38
`34
`483
`426
`431
`
`2
`491
`125
`
`450
`
`492
`259
`
`347
`
`94.84
`93.24
`
`88.84
`84.84
`84.54
`78.65
`
`75.42
`69.76
`67.75
`62.84
`61.13
`58.50
`
`58.13
`42.33
`39.40
`
`37.30
`
`35.80
`33.49
`
`32.84
`
`NT10 is ‘A’ (cleavage site)
`0.1029
`0.1485 Motif ‘UCU’ is present in
`siRNA
`−0.1213
`NT19 is ‘A’
`0.2176
`NT1 is ‘U’
`0.2709 G in NT1..NT4 (dG1-4)
`−0.0972 Motif ‘ACGA’ is present
`in siRNA
`0.1492 G in NT5..NT8 (dG5-8)
`−0.0034
`NT18 is ‘G’
`−0.1045
`NT14 is ‘G’
`0.0415
`GC content > 35%
`0.1286 G in NT13..NT14
`−0.1495 G in NT18..NT19
`(dG18-19)
`0.0884
`NT1 is ‘A’
`0.2009
`GC content < 70%
`−0.1323 Motif ‘GCC’ is present in
`siRNA
`−0.1957
`Folding is present in
`siRNA (binary value)
`GC content < 75%
`0.2280
`−0.0911 Motif ‘GUGG’ is present
`in siRNA
`0.0217 Motif ‘UCCG’ is present
`in siRNA
`
`List of features that were most dominant in the models generated by our Bayesian
`Markov chain Monte Carlo algorithm. The columns depict the overall percentile
`appearance of each feature (100% representing an appearance of 25 000 000 times in
`the models generated by our algorithm), the correlation between the feature and the
`product level variable (positive correlation coefficient indicates an increasing of siRNA
`potency), as well as its biological meaning. The first thirteen features appear in more
`than 50% of the runs.
`
`in our analysis should solve overlap issues though, minimising any
`masking of significant features by less influential ones.
`
`3.1 Motifs
`It is particularly interesting that one of the most important features
`(cid:3)
`(cid:3)
`detected by our algorithm is the 3-mer 5
`–3
`motif, ‘UCU’, which
`has, in the main, escaped the notice of previous siRNA potency
`studies. More specifically, Teramoto et al. (Teramoto2005) do not
`report it at all, whereas Vert et al. (Vert et al., 2007) include it in their
`table of included features, but do not make any special reference to
`it in their main text or any other distinction between ‘UCU’ and all
`other motifs studied by them.
`‘UCU’ has a positive correlation coefficient, which means that it
`increases siRNA potency. Including all datasets, 60% of the siRNAs
`containing the motif are potent for a product level threshold of 0.3
`(cf. Table 4). More specifically, motif ‘UCU’ was found present at
`least once in 2560 siRNAs, of which 1538 were potent. Exclusion
`of the SIR dataset, where the product level was stated as an ordinal
`and not as a continuous value, led to qualitatively similar findings,
`strengthening the confidence in this finding. A detailed breakdown
`of the motif’s occurrences by position in the siRNA sequence can
`be found in Supplementary Material.
`
`Downloaded from https://academic.oup.com/bioinformatics/article-abstract/25/13/1594/196518
`by guest
`on 23 March 2018
`
`

`

`It is important to highlight that the positive correlation found
`between ‘UCU’ and silencing efficacy is unlikely to reflect
`a structural or compositional characteristic of
`the sequences
`containing the specific motif, as such features were checked for
`correlation by our Bayesian MCMC algorithm, but were not selected
`as significant. Other reasons, such as a possible codon-specific bias,
`can also be excluded (cf. Table 4). Hence, this motif appears to
`complement the thermodynamic and compositional characteristics
`that have been previously found to affect siRNA potency, seemingly
`having a role of its own in the silencing process.
`Having

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket