Forthcoming and Online First Articles

International Journal of Data Mining and Bioinformatics

International Journal of Data Mining and Bioinformatics (IJDMB)

Forthcoming articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

Online First articles are published online here, before they appear in a journal issue. Online First articles are fully citeable, complete with a DOI. They can be cited, read, and downloaded. Online First articles are published as Open Access (OA) articles to make the latest research available as early as possible.

Open AccessArticles marked with this Open Access icon are Online First articles. They are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

Register for our alerting service, which notifies you by email when new issues are published online.

International Journal of Data Mining and Bioinformatics (15 papers in press)

Regular Issues

  • Plasma proteins related to the state of depression: a case-control study based on proteomics data of pregnant women.   Order a copy of this article
    by Yuhao Feng, Jinman Zhang, Zengyue Zheng, Chenyu Xing, Min Li, Guanghong Yan, Ping Chen, Dingyun You, Ying Wu 
    Abstract: Prenatal and postpartum emotional changes in pregnant women in early pregnancy are of great significance to the physical and mental health of mothers and infants. To identify factors related to this, we conducted this study to identify feature proteins that cause maternal depression. Boruta algorithm (BA), recursive partition algorithm (RPA), regularised random forest (RRF) algorithm, least absolute shrinkage and selection operator (LASSO) algorithm, and genetic algorithm (GA) were used to select features. Extreme gradient boosting (XGBoost), back propagation neural network (BPNN), support vector machine (SVM), random forest (RF), and logistic regression (LR) were selected to construct the predictive models. All models showed a good performance in predicting, with the mean AUC (the area under the receiver operating curve) exceeding 80%. Features will provide clues to prevent depression in pregnant women and improve the physical and mental health of mothers and babies.
    Keywords: pregnant women; depression; proteomics; biomarkers; feature selection.
    DOI: 10.1504/IJDMB.2025.10064226
     
  • Cross-modal imputation and gated GCN for predicting miRNA-disease associations (CIGGNET)   Order a copy of this article
    by Yan Chen , Zhenjie Hou, Wenguang Zhang, Han Li, Haibin Yao 
    Abstract: microRNA(miRNA) is a short-chain non-coding RNA molecule encoded by endogenous genes. Currently, many miRNAs related to complex diseases have been found, which provides help for further exploring the molecular mechanism of disease pathogenesis. We proposed an algorithm named CIGGNET for predicting the association between miRNA-disease based on cross-modal data imputation and gated graph convolution network. First, CIGGNET uses a cross-modal data imputation operation on the miRNA-disease association matrix to obtain the filled association matrix. Second, CIGGNET integrates miRNA-disease heterogeneous networks, extracts features of miRNAs and diseases use random wander algorithm, and learns miRNA and disease embeddings using graph convolutional network. Third, CIGGNET uses a gating operation to select the appropriate convolution layer. The control gate adaptively outputs suitable convolution layers based on the similarity of different convolution layers and scores unobserved associations. The mean AUC of CIGGNET is 0.9423 in 100 five-fold cross-validations.
    Keywords: miRNA; disease; MiRNA-disease association prediction; cross-modal data imputation; gated graph convolution network.
    DOI: 10.1504/IJDMB.2025.10064546
     

Special Issue on: Empowering the Future Generation of Data Mining and Knowledge Discovery in Bioinformatics

  • A novel intelligent-based intrusion detection and prevention system in the cloud using deep learning with meta-heuristic strategy   Order a copy of this article
    by Srilatha Doddi, Thillaiarasu N 
    Abstract: Cloud computing serves diverse options for end-users to minimise costs, and services are easily accessible through online platforms. While the users access the services remotely, the attackers launch cyber-attacks to disrupt the services. Cloud security analysts treat the security of the cloud as a potential area of research to minimise the impacts of abnormal behaviour. One of the potential solutions to detect attacks is the development of the next-generation intrusion detection and prevention system (IDPS). Hence, this paper proposes an efficient IDPS using a hybridised model known as hybrid firebug-squirrel swarm algorithm-based ensemble classifiers (HF-SSA-EC). Initially, the NSL-KDD cup 1999 dataset is considered for experimental analysis. The efficient features are extracted via restricted Boltzmann machines (RBM) layers of the deep belief network (DBN) model. The extracted features are submitted to the ensemble classifiers (ECs), which use naive Bayes (NB), support vector machines (SVM), deep neural networks (DNN), and recurrent neural networks (RNN) for identifying the intrusions. EC parameter optimisation using a hybridised HF-SSA meta-heuristic improves performance. Finally, the prevention model eliminates malicious nodes from detected intrusions. Meta-heuristic clustering is used in the preventative model. The experimental results reveal that the recommended IDPS outperforms existing models.
    Keywords: intrusion detection and prevention system; IDPS; cloud computing; restricted Boltzmann machines; RBM; deep feature extraction; firebug swarm optimisation; FSO; squirrel search algorithm.
    DOI: 10.1504/IJDMB.2025.10062482
     
  • Metaheuristic gene regulatory networks inference using discrete crow search algorithm and quantitative association rules   Order a copy of this article
    by Makhlouf Ledmi, Mohammed El Habib Souidi, Aboubekeur Hamdi-Cherif, Abdeldjalil Ledmi, Hichem Haouassi, Chafia Kara-Mohamed 
    Abstract: Gene regulatory networks (GRNs) inference appeared as valuable tools for detecting irregularities in cell regulation. Association rule mining (ARM) encompasses specific data mining methods capable of inferring unknown associations between genes. In response to the scarcity of ARM-based GRN inference, a novel metaheuristic algorithm, DCSA-QAR, is presented. This algorithm infers quantitative association rules by discretising the crow search algorithm. A first series of experiments involved comparison with five metaheuristic algorithms on six datasets. The results showed that, for Co-citation and YeastNet datasets, our algorithm was first in precision (100%), specificity (100%) and score (3.75). A second series of experiments involved nine information-theoretic algorithms through the DREAM3 and SOS networks. The average results on DREAM3 datasets are compensated by the SOS real datasets results: the best in accuracy, and true positives. As an overall appraisal, DCSA-QAR can be considered as a good candidate for ARM-based metaheuristic GRNs inference.
    Keywords: artificial intelligence; bioinformatics; gene regulatory networks; GRNs; data mining; soft computing; mining association rules.
    DOI: 10.1504/IJDMB.2025.10062651
     

Special Issue on: New Applications of Computational Biology and Bioinformatics

  • Spearman dependence function-based goodness-of-fit test for the gene's relation   Order a copy of this article
    by Selim Orhun Susam, Burcu Hudaverdi 
    Abstract: A gene network represents the relationship between different groups of genes with various functions, aiming to depict how genes collaborate and influence each other’s activities within a biological system. This relationship can be effectively explained using copulas. Therefore, it is crucial to determine which copula best fits the gene data and provides the most accurate explanation of the relationships between gene groups. In this study, our objective is to introduce a Spearman dependence function-based goodness-of-fit test using Bernstein polynomial approximation. We apply this test to identify a copula model that can effectively explain the relationships between gene groups. A Monte Carlo simulation study is conducted to assess the performance of the proposed test. Next, we analyze histone gene groups using data from yeast cell regulation, as provided by Eisen et al.(1998). Specifically, we investigate the dependence model structures of gene interactions for eight histone genes.
    Keywords: Spearman dependence; copula goodness-of-fit test; Bernstein copula; histone genes.
    DOI: 10.1504/IJDMB.2025.10061726
     
  • Research on facial dataset cleaning in mixed scenes based on spatiotemporal correlation   Order a copy of this article
    by Siguang Dai 
    Abstract: Researching methods for cleaning mixed scene facial datasets can improve the performance and reliability of mixed scene facial recognition algorithms. Therefore, the paper proposes a facial dataset cleaning method in mixed scenes based on spatiotemporal correlation. The 2DPCA algorithm is used to reduce the dimensionality of the data set, and the composite multi-scale entropy is used to decompose, reconstruct and arrange the image sequence after the dimensionality reduction. The autocorrelation coefficient and the number of interrelation between image sequences were determined, and the anomaly detection of data set was realised by combining spatio-temporal correlation. Sparse representation was used to repair the abnormal images, and the images with high similarity were deleted to clean the mixed scene face data set. The experimental results show that the minimum anomaly rate of our method is 0.5%, the success rate is between 94% and 96%, and the minimum time cost is 0.2 s.
    Keywords: spatiotemporal correlation; mixed scenes; facial dataset; dataset cleaning; 2DPCA algorithm; composite multi-scale entropy; sparse representation.
    DOI: 10.1504/IJDMB.2025.10061768
     
  • Identification of potential biomarkers of esophageal squamous cell carcinoma using community detection algorithms   Order a copy of this article
    by Bikash Baruah, Domum Karlo, Manash P. Dutta, Subhasish Banerjee, Dhruba K. Bhattacharyya 
    Abstract: Potential biomarker genes are uncovered in this research by developing a unique methodology through the employment of six eminent community detection algorithms (CDAs) on four RNAseq esophageal squamous cell carcinoma (ESCC) datasets. RNAseq datasets are preprocessed using galaxy server followed by the identification of a subset of differentially expressed genes (DEGs). CDAs are applied separately on control and disease samples of DEGs to extract the hidden communities of the datasets. To identify the significant communities, ESCC elite genes are extracted from Genecards for subsequent downstream analysis towards the identification of potential biomarkers. Topological analysis is performed to support critical gene identification based on elite genes followed by a biological investigation. For biological investigation, gene enrichment and pathway analysis are implemented. Finally, a group of genes EPHB2, ABLIM3, ACER1, ABCD4, ARF6, ADRA1D, ATP6V1D, CLTB, ATP6V0A4, and AP1M1 are identified as ESCC possible biomarkers that carry both topological and biological significance.
    Keywords: community detection algorithm; CDA; potential biomarker; esophageal squamous cell carcinoma; ESCC; Elite gene; topological analysis; biological significance.
    DOI: 10.1504/IJDMB.2025.10061876
     
  • Research on bioinformatics data classification method based on support vector machine   Order a copy of this article
    by Hui Yan, Yunxin Long, Chao Lv, Ping Yu, Duo Long 
    Abstract: Due to the problems of low classification accuracy and long classification time in traditional biological information data classification methods, a biological information data classification method based on support vector machine is proposed. Acquire bio-information data through gene expression and analyse its characteristics. According to the data analysis results, carry out outlier detection and data scaling for the acquired bio-information data. Based on the processing results, use mutual information to measure the correlation and redundancy, select the bio-information data features through the feature selection algorithm of minimum redundancy and maximum correlation, and take the selected bio-information data features as data samples. Through support vector machine, the classification decision function is established under the conditions of linear and non-separable data samples to obtain the classification results of biological information data. The experimental results show that the proposed method has higher classification accuracy and shorter classification time.
    Keywords: support vector machine; bioinformation; data classification; minimum redundancy and maximum correlation; feature selection.
    DOI: 10.1504/IJDMB.2025.10061944
     
  • Log anomaly detection and diagnosis method based on deep learning   Order a copy of this article
    by Zhiwei Liu, Xiaoyu Li, Dejun Mu 
    Abstract: In order to improve the accuracy of log anomaly detection and diagnostic effectiveness, this paper proposes a deep learning based log anomaly detection and diagnosis method. Firstly, analyse the log data and obtain the corresponding relationship between the log keys and log parameters. Secondly, using deep learning to capture association features, a convolutional neural network bidirectional long short-term memory (CNN-BiLSTM) deep learning model is constructed. Finally, learning context sequence feature information from both positive and negative directions through bidirectional input, and implementing log anomaly detection and diagnosis based on the results of context sequence feature information. The experimental results show that the accuracy of log anomaly detection in this method can reach 98.6%, the time required for log anomaly detection can reach 1.1 s, and the recall rate for log anomaly detection is 96.8%. The log anomaly detection effect is good.
    Keywords: deep learning; one hot encoding; context sequence features; log exception.
    DOI: 10.1504/IJDMB.2025.10062017
     
  • Classification and retrieval method of personal health data based on differential privacy   Order a copy of this article
    by Guanpeng Xu, Liang Zhao 
    Abstract: Research on personal health data classification and retrieval methods can improve the accuracy and efficiency of medical decision-making, promoting the development of personalised medicine. To overcome the issues of low accuracy, long retrieval time, and low satisfaction in traditional methods, a classification and retrieval method of personal health data based on differential privacy is proposed. The method involves encrypting personal health data using linear regression model and differential privacy, constructing a classification objective function through integrated manifold learning to classify the encrypted results of personal health data. Binary hash codes are used to retrieve the classification results, and the decrypted retrieval results are provided to users for personal health data classification and retrieval. The experimental results demonstrate that this method achieves a maximum accuracy of 96.8% in personal health data classification and retrieval, with a minimum retrieval time of 20 ms and an average satisfaction of 97.1% for the retrieval results.
    Keywords: differential privacy; personal health data; classification and retrieval; linear regression model; encrypted results; binary hash code.
    DOI: 10.1504/IJDMB.2025.10062018
     
  • Prediction method of commercial customers' mental health based on data mining   Order a copy of this article
    by Yanhua Shen, Bing Gao 
    Abstract: For commercial customer management, mental health prediction is crucial, therefore, a data mining based method for predicting the mental health of commercial customers is proposed. Firstly, the K-means algorithm is used to mine and process the psychological health test data of commercial customers. Secondly, develop a program for evaluating the psychological health of commercial customers, construct a judgment matrix, and calculate weight coefficients to obtain the evaluation results of the psychological health level of commercial customers. Finally, based on the evaluation results of mental health level as input and the predicted results of mental health, a BP neural network is used to build a commercial customer mental health prediction model. The experimental data shows that after the proposed method is applied, the mining results of commercial customers’ mental Health data are consistent with the actual results, and the minimum error of commercial customers’ mental health prediction is 0.4%.
    Keywords: commercial customers; mental health; enterprise development; data mining technology; prediction model construction.
    DOI: 10.1504/IJDMB.2025.10062484
     
  • Longitudinal analysis for predicting amino acid changes in HIV-1 using association rule mining   Order a copy of this article
    by Mounira Lakab, Abdelouaheb Moussaoui 
    Abstract: The human immunodeficiency virus (HIV) remains a great challenges for humanity. HIV is characterised by high mutational rate, resulting into pathogenic variants that promotes the escape of immune response. In order to understand the correlations between amino acid mutations of the virus and quantify the evolutionary in HIV. We present a novel approach based on association rule mining (ARM) from protein sequence data taken at different time points. In this study, a longitudinal association rule mining (LARM) algorithm has been proposed. We collected the entire genome of 100 untreated HIV-1 infected patients over 3-5 years of infection, with 6-10 longitudinal samples per patient. We used the Los Alamos intra-patient search interface. Our experiments show the effectiveness of the proposed method in discovering major amino acid changes in comparison with the temporal analysis.
    Keywords: association rule mining; longitudinal data; HIV-1; mutation; amino acid; data mining.
    DOI: 10.1504/IJDMB.2025.10062519
     
  • An advanced approach for DNA sequencing and similarities analysis on the basis of groupings of nucleotide bases   Order a copy of this article
    by Kshatrapal Singh, Laxman Singh, Vijay Shukla, Yogesh Kumar Sharma, Arun Kumar Rai 
    Abstract: In order to seamlessly identify the links between various DNA sequences on a broad scale, DNA sequencing is a crucial tool. But there are still more potential for advancement in sequencing quality. A highly well-liked method for determining sequence similarities is the alignment-free technique. As per their chemical characteristics, the four bases of DNA sequences A, C, G, and T are separated in three types of groupings in this research. A primary DNA sequence is transformed into three symbolic sequences. In order to depict the sequence, the frequencies of group variations of three notational sequences have been aggregated in a 12-component vector. The nucleotide sequences of beta globin gene on a dataset of several species are characterised and compared using the Euclidean distances across inserted vectors. Using phylogenetic trees, the evolutionary relationships between various organisms are visually represented. A phylogenetic tree’s branch structure shows how several species or other groups diverged from several common ancestors. Our findings are in agreement with recent biological assessments. Additionally, we compared our approach to a few currently used sequence comparing techniques and discover that it is more efficient and user-friendly. We also analysed the time and space complexities of our proposed approach.
    Keywords: alignment-free technique; similarity analysis; bases groupings; mutation; phylogenetic tree.
    DOI: 10.1504/IJDMB.2025.10063428
     
  • In silico evaluation via the docking of selected antidiabetic phytochemicals on proteins in the insulin signalling pathway: PTP1B, IRS1 and PP2A   Order a copy of this article
    by Hazim Alsharabaty, Niveen Alayasi, Safa Radi Jabarin, Siba Shanak, Hilal Zaid 
    Abstract: Type II Diabetes Mellitus (T2MD) is a worldwide disease, caused by the resistance of tissues to insulin. In this study, eight potential antidiabetic phytochemicals from Gundelia tournefortii and Ocimum basilicum were tested in silico. To this aim, we docked the phytochemicals on pivotal proteins in the insulin signalling pathway; using the docking protocol of AutoDock. This work aimed at understanding the mechanism of action of these phytochemicals by finding the optimal binding site, calculating the best orientation, and studying the amino acids involved at the interaction interface between the phytochemicals and each protein target. Our results indicated that stigmasterol, beta-amyrinm, beta-sitosterol, lupeol-trifluoroacetate and lupeol introduce good binding to PTP1B, IRS1, and PP2A and are candidate drugs for the treatment of T2DM. The results of the study may serve as a focal point for drug discovery that may be further extended in the in vitro, in vivo and clinical studies.
    Keywords: diabetes; phytochemicals; in silico; Gundelia tournefortii; Ocimum basilicum; docking; AutoDock.
    DOI: 10.1504/IJDMB.2025.10064690
     

Special Issue on: The Development of Novel Integrative Bioinformatics Based Machine Learning Techniques and Multi Omics Data Integration Part 2

  • Machine learning algorithm for lung cancer classification using ADASYN with standard random forest   Order a copy of this article
    by J. Viji Gripsy, T. Divya  
    Abstract: Lung cancer is one type of cancer that develops in the lungs. Early identification of lung cancer symptoms may lead to a successful treatment. The dataset indicates the presence of duplicate characteristics, as well as an imbalanced classification, making lung cancer classification a challenging task. This study presents a novel approach that combines the ADASYN with the standard random forest (ASRF) model to enhance the efficacy of lung cancer dataset identification. The ASRF, as described, offers interpretable outcomes by using feature significance, hence providing significant insights into the aspects that contribute to judgments on the classification of lung cancer. The classification algorithm is used to ascertain the existence or absence of lung cancer in a certain patient. When comparing the proposed ASRF with the current SVM, MLP, RF and GB, compared to other existing methods, the ASRF technique achieved 93.5% precision, 94.7% recall, 94.1% F-measure, and 94% accuracy.
    Keywords: lung cancer; LC; RF ASRF; MLP; support vector machine; SVM; GB.
    DOI: 10.1504/IJDMB.2025.10065391