Title: Semi-supervised clustering algorithm for haplotype assembly problem based on MEC model
Authors: Xin-Shun Xu; Ying-Xin Li
Addresses: School of Computer Science and Technology, Shandong University, Jinan 250101, China; The National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China ' Institute of Machine Vision and Machine Intelligence, Beijing Jingwei Textile Machinery New Technology Co., Ltd., No. 8 Yongchang Zhong Road, BDA, Beijing 100176, China
Abstract: Haplotype assembly is to infer a pair of haplotypes from localized polymorphism data. In this paper, a semi-supervised clustering algorithmSSK (Semi-Supervised K-means) is proposed for it, which, to our knowledge, is the first semi-supervised clustering method for it. In SSK, some positive information is firstly extracted. The information is then used to help k-means to cluster all SNP fragments into two sets from which two haplotypes can be reconstructed. The performance of SSK is tested on both real data and simulated data. The results show that it outperforms several state-of-the-art algorithms on Minimum Error Correction (MEC) model.
Keywords: semi-supervised clustering; machine learning; k-means; haplotype assembly; bioinformatics; MEC model; minimum error correction; haplotypes.
DOI: 10.1504/IJDMB.2012.049279
International Journal of Data Mining and Bioinformatics, 2012 Vol.6 No.4, pp.429 - 446
Received: 26 Aug 2010
Accepted: 01 Jan 2011
Published online: 17 Dec 2014 *