Title: Genome-wide discovery of miRNAs using ensembles of machine learning algorithms and logistic regression
Authors: Benjamin Ulfenborg; Karin Klinga-Levan; Björn Olsson
Addresses: Systems Biology Research Centre, School of Bioscience, University of Skövde, Skövde, Sweden ' Systems Biology Research Centre, School of Bioscience, University of Skövde, Skövde, Sweden ' Systems Biology Research Centre, School of Bioscience, University of Skövde, Skövde, Sweden
Abstract: In silico prediction of novel miRNAs from genomic sequences remains a challenging problem. This study presents a genome-wide miRNA discovery software package called GenoScan and evaluates two hairpin classification methods. These methods, one ensemble-based and one using logistic regression were benchmarked along with 15 published methods. In addition, the sequence-folding step is addressed by investigating the impact of secondary structure prediction methods and the choice of input sequence length on prediction performance. Both the accuracy of secondary structure predictions and the miRNA prediction are evaluated. In the benchmark of hairpin classification methods, the regression model achieved highest classification accuracy. Of the structure prediction methods evaluated, ContextFold achieved the highest agreement between predicted and experimentally determined structures. However, both the choice of secondary structure prediction method and input sequence length had limited impact on hairpin classification performance.
Keywords: miRNA prediction; miRNA discovery; RNA structure prediction; GenoScan; ensemble classifiers; logistic regression modelling; machine learning; bioinformatics; genomic sequences; hairpin classification; sequence folding; secondary structure prediction; sequence length.
DOI: 10.1504/IJDMB.2015.072755
International Journal of Data Mining and Bioinformatics, 2015 Vol.13 No.4, pp.338 - 359
Received: 15 Jan 2015
Accepted: 25 Jan 2015
Published online: 28 Oct 2015 *