An empirical study of self-training and data balancing techniques for splice site prediction Online publication date: Mon, 06-Feb-2017
by Ana Stanescu; Doina Caragea
International Journal of Bioinformatics Research and Applications (IJBRA), Vol. 13, No. 1, 2017
Abstract: Thanks to Next Generation Sequencing technologies, unlabelled data is now generated easily, while the annotation process remains expensive. Semi-supervised learning represents a cost-effective alternative to supervised learning, as it can improve supervised classifiers by making use of unlabelled data. However, semi-supervised learning has not been studied much for problems with highly skewed class distributions, which are prevalent in bioinformatics. To address this limitation, we carry out a study of a semi-supervised learning algorithm, specifically self-training based on Naïve Bayes, with focus on data-level approaches for handling imbalanced class distributions. Our study is conducted on the problem of predicting splice sites and it is based on datasets for which the ratio of positive to negative examples is 1-to-99. Our results show that under certain conditions semi-supervised learning algorithms are a better choice than purely supervised classification algorithms.
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Bioinformatics Research and Applications (IJBRA):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email subs@inderscience.com