Title: A cluster and label approach for classifying imbalanced data streams in the presence of scarcely labelled data
Authors: Kiran Bhowmick; Meera Narvekar
Addresses: Department of Computer Engineering, D J Sanghvi College of Engineering, Mumbai, 400056, India ' Department of Computer Engineering, D J Sanghvi College of Engineering, Mumbai, 400056, India
Abstract: Classifying imbalanced data streams is often a challenging task primarily due to the continuous flow of infinite data and due to the unavailability of class labels. The problem is two-fold when the stream is imbalanced in nature. Due to the characteristics of data streams, it is impossible to store and process the data and deal with imbalance. There is a need to provide a solution that can consider the unavailability of class labels and classify the imbalanced data streams. This paper proposes a semi-supervised learning (SSL)-based model to classify scarcely labelled imbalanced data streams. A modified cluster and label SSL approach that uses expectation maximisation for clustering and similarity-based label propagation for labelling the unlabelled clusters is proposed. The model also employs a novel imbalance sensitive cluster merge technique to deal with the imbalance data. The results prove that the model outperforms standard stream classification algorithms.
Keywords: data streams; classification; imbalanced data; semi-supervised learning; scarcely labelled; cluster and label; micro cluster; label propagation.
DOI: 10.1504/IJBIDM.2022.126503
International Journal of Business Intelligence and Data Mining, 2022 Vol.21 No.4, pp.443 - 464
Received: 14 Apr 2021
Accepted: 26 Jun 2021
Published online: 27 Oct 2022 *