Title: Clustering sequences by overlap
Authors: Dietmar H. Dorr, Anne M. Denton
Addresses: Department of Computer Science, North Dakota State University, Fargo, ND, 58105, USA. ' Department of Computer Science, North Dakota State University, Fargo, ND, 58105, USA
Abstract: A clustering algorithm is introduced that combines the strengths of clustering and motif finding techniques. Clusters are identified based on unambiguously defined sequence sections as in motif finding algorithms. The definition of similarity within clusters allows transitive matches and, thereby, enables the discovery of remote homologies that cannot be found through motif-finding algorithms. Directed Acyclic Graph (DAG) structures are constructed that link short clusters to the longer ones. We compare the clustering results to the corresponding domains in the InterPro database. A second comparison shows that annotations based on our domains are inherently more consistent than those based on InterPro domains.
Keywords: sequence clustering; motif finding; annotation; bioinformatics; DAG; directed acyclic graph; InterPro domains; similarity; transitive matches; remote homologies.
DOI: 10.1504/IJDMB.2009.026701
International Journal of Data Mining and Bioinformatics, 2009 Vol.3 No.3, pp.260 - 279
Received: 22 Jun 2007
Accepted: 18 Jan 2008
Published online: 23 Jun 2009 *