Asymmetric trichotomous data partitioning enables development of predictive machine learning models using limited siRNA efficacy datasets [preprint]
UMass Chan AffiliationsMorningside Graduate School of Biomedical Sciences
RNA Therapeutics Institute
MetadataShow full item record
AbstractChemically modified small interfering RNAs (siRNAs) are promising therapeutics guiding sequence-specific silencing of disease genes. However, identifying chemically modified siRNA sequences that effectively silence target genes is a challenge. Such determinations necessitate computational algorithms. Machine Learning (ML) is a powerful predictive approach for tackling biological problems, but typically requires datasets significantly larger than most available siRNA datasets. Here, we describe a framework for applying ML to a small dataset (356 modified sequences) for siRNA efficacy prediction. To overcome noise and biological limitations in siRNA datasets, we apply a trichotomous (using two thresholds) partitioning approach, producing several combinations of classification threshold pairs. We then test the effects of different thresholds on random forest (RF) ML model performance using a novel evaluation metric accounting for class imbalances. We identify thresholds yielding a model with high predictive power outperforming a simple linear classification model generated from the same data. Using a novel method to extract model features, we observe target site base preferences consistent with current understanding of the siRNA-mediated silencing mechanism, with RF providing higher resolution than the linear model. This framework applies to any classification challenge involving small biological datasets, providing an opportunity to develop high-performing design algorithms for oligonucleotide therapies.
SourceAsymmetric trichotomous data partitioning enables development of predictive machine learning models using limited siRNA efficacy datasets Kathryn R. Monopoli, Dmitry Korkin, Anastasia Khvorova bioRxiv 2022.07.08.499317; doi: https://doi.org/10.1101/2022.07.08.499317
Permanent Link to this Itemhttp://hdl.handle.net/20.500.14038/51495
NotesThis article is a preprint. Preprints are preliminary reports of work that have not been certified by peer review.
RightsThe copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.; Attribution-NonCommercial 4.0 International
The following license files are associated with this item:
- Creative Commons
Except where otherwise noted, this item's license is described as The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.