Mining large-scale repetitive sequences in a MapReduce setting Online publication date: Mon, 22-Feb-2016
by Hongfei Cao; Michael Phinney; Devin Petersohn; Benjamin Merideth; Chi-Ren Shyu
International Journal of Data Mining and Bioinformatics (IJDMB), Vol. 14, No. 3, 2016
Abstract: Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA, and intermediate data generated by alignment- and hash-based approaches are substantial. We present a MapReduce-based method for repeat identification and propose efficient storage and search techniques. Our approach distributes the computation and storage across a cluster of commodity computers, lending a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of six genomes, totalling approximately 14.2 billion base pairs. We demonstrate a tenfold speedup over previous state-of-the-art approaches and linear scalability. In addition, we conduct a deeper scalability analysis by processing a collection of 39 genomes, approximately 104 billion base pairs.
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Data Mining and Bioinformatics (IJDMB):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email subs@inderscience.com