Title: Ensemble feature selection approach for imbalanced textual data using MapReduce
Authors: Houda Amazal; Mohammed Ramdani; Mohamed Kissi
Addresses: Computer Science Laboratory, Faculty of Sciences and Technologies, University Hassan II Casablanca, BP 146, 20650 Mohammedia, Morocco ' Computer Science Laboratory, Faculty of Sciences and Technologies, University Hassan II Casablanca, BP 146, 20650 Mohammedia, Morocco ' Computer Science Laboratory, Faculty of Sciences and Technologies, University Hassan II Casablanca, BP 146, 20650 Mohammedia, Morocco
Abstract: Feature selection is a fundamental pre-processing phase in text classification. It speeds up machine learning algorithms and improves classification accuracy. In big data context, feature selection techniques have to deal with two major issues which are the huge dimensionality and the imbalancing aspect of data. However, the libraries of big data frameworks, such as Hadoop, only implement a few single feature selection methods whose robustness does not meet the requirements imposed by the large amount of data. To deal with this, we propose in this paper a distributed ensemble feature selection (DEFS) approach for imbalanced large dataset using MapReduce. A set of experiments are being conducted on four datasets to confirm the improvement brought about by the proposed approach. The reported results show that in most cases our method results in better classification performance than other widely used feature selection techniques.
Keywords: ensemble feature selection? EFS? imbalance data? MapReduce? text classification.
DOI: 10.1504/IJBIDM.2021.118925
International Journal of Business Intelligence and Data Mining, 2021 Vol.19 No.4, pp.395 - 417
Received: 18 Nov 2019
Accepted: 28 Feb 2020
Published online: 12 Nov 2021 *