Categorising texts more accurately with field association terms Online publication date: Sat, 26-Sep-2015
by Tshering Cigay Dorji
International Journal of Computer Applications in Technology (IJCAT), Vol. 52, No. 2/3, 2015
Abstract: Popular text classification algorithms such as Naïve Bayes, kNN, Centroid-based classifiers and support vector machines (SVM) are based on supervised machine learning. They normally use classical text representation technique consisting of a 'bag of words' as features. This representation leads to the inclusion of unimportant features, and the loss of important semantic relationships and inflection information, resulting in accuracy reduction. To address this problem, we propose a new text classification methodology based on field association terms - a set of terms that identify specific document fields. The methodology is compared against Naïve Bayes, kNN, Centroid-based classifier and SVM on a close dataset of 3180 documents from Wikipedia dumps and open dataset of 9449 documents from Reuters RCV1 Corpus, 20-Newsgroup and 4-Universities datasets. The new method outperformed the other algorithms with a precision of 97% as compared with Centroid-based 85%, Naïve Bayes 78%, kNN 48% and SVM 42%.
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Computer Applications in Technology (IJCAT):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email subs@inderscience.com