Article: Text classification using document-document semantic similarity Journal: International Journal of Web Science (IJWS) 2013 Vol.2 No.1/2 pp.1 - 26 Abstract: One of the key problems encountered while using a text classification learning algorithms is that they require huge amount of labelled examples to learn accurately. The objective of this paper is to propose a novel method of topic modelling and document-document semantic similarity algorithm (DDSSA), which reduces the need for larger training data. This algorithm finds the concepts and keywords of the unlabelled text, identifying the topic of unlabelled text from list of concepts and keywords obtained from labelled text. This can be achieved by obtaining the concepts of the labelled text and identify the keywords which holds strong relationships with given labelled data. This topics and keywords obtained from the labelled text can be stored in the database which in turn can be used to compute the semantic similarity with concepts obtained from the unlabelled text. The proposed method is compared with the popular latent semantic analysis (LSA) applied in NLTK and Mallet datasets. The experiment result shows that the proposed method is superior to LSA in most of the cases. Inderscience Publishers - linking academia, business and industry through research

You can view the full text of this article for free using the link below.

Title: Text classification using document-document semantic similarity

Authors: Indrajit Mukherjee; Prabhat Kumar Mahanti; Vandana Bhattacharya; Samudra Banerjee

Addresses: Department of Computer Science and Engineering, BIT Mesra, Ranchi, India ' Department of Computer Science and Applied Statistics (CSAS), University of New Brunswick Canada, Canada ' Department of Computer Science and Engineering, BIT Mesra, Ranchi, India ' Server Technologies Division, Oracle India Pvt. Ltd., Prestige Lexington Towers, Bangalore, India

Abstract: One of the key problems encountered while using a text classification learning algorithms is that they require huge amount of labelled examples to learn accurately. The objective of this paper is to propose a novel method of topic modelling and document-document semantic similarity algorithm (DDSSA), which reduces the need for larger training data. This algorithm finds the concepts and keywords of the unlabelled text, identifying the topic of unlabelled text from list of concepts and keywords obtained from labelled text. This can be achieved by obtaining the concepts of the labelled text and identify the keywords which holds strong relationships with given labelled data. This topics and keywords obtained from the labelled text can be stored in the database which in turn can be used to compute the semantic similarity with concepts obtained from the unlabelled text. The proposed method is compared with the popular latent semantic analysis (LSA) applied in NLTK and Mallet datasets. The experiment result shows that the proposed method is superior to LSA in most of the cases.

Keywords: topic modelling; WorldNet; latent Dirichlet allocation; LDA; latent semantic analysis; LSA; text classification; semantic similarity; learning algorithms; unlabelled text; concepts; keywords.

DOI: 10.1504/IJWS.2013.056572

International Journal of Web Science, 2013 Vol.2 No.1/2, pp.1 - 26

Received: 05 Jun 2012
Accepted: 29 Oct 2012
Published online: 02 Jul 2014 *

Full-text access for editors Full-text access for subscribers Free access Comment on this article

Title: Text classification using document-document semantic similarity

Keep up-to-date