Title: Feature selection methods for document clustering: a comparative study and a hybrid solution
Authors: Asmaa Benghabrit; Brahim Ouhbi; Bouchra Frikh; El Moukhtar Zemmouri; Hicham Behja
Addresses: LM2I Laboratory, ENSAM, Moulay Ismaïl University, Meknes, Morocco ' LM2I Laboratory, ENSAM, Moulay Ismaïl University, Meknes, Morocco ' LTTI Laboratory, EST-Fès, Sidi Mohamed Ben Abdellah, Fes, Morocco ' LM2I Laboratory, ENSAM, Moulay Ismaïl University, Meknes, Morocco ' Greentic Laboratory, ENSEM, University Hassan II, Casablanca, Casablanca, Morocco
Abstract: The web proliferation makes the exploration and the use of the huge amount of available unstructured text documents challenged, which drives the need of document clustering. Hence, improving the performances of this mechanism by using feature selection seems worth investigation. Therefore, this paper proposes an efficient way to highly benefit from feature selection for document clustering. We first present a review and comparative studies of feature selection methods in order to extract efficient ones. Then we propose a sequential and hybrid combination modes of statistical and semantic techniques in order to benefit from crucial information that each of them provides for document clustering. Extensive experiments prove the benefit of the proposed combination approaches. The performance of document clustering is highest when the measures based on Chi-square statistic and the mutual information are linearly combined. Doing so, it avoids the unwanted correlation that the sequential approach creates between the two treatments.
Keywords: document clustering; feature selection; statistical and semantic data analysis; chi-square statistic; mutual information; k-means algorithm; comparative study; hybrid solution.
DOI: 10.1504/IJDATS.2019.101154
International Journal of Data Analysis Techniques and Strategies, 2019 Vol.11 No.3, pp.246 - 272
Received: 01 Jul 2016
Accepted: 21 Jul 2017
Published online: 26 Jul 2019 *