Title: Inclusion of Wikipedia, a language specific knowledge resource to generate and update a synset in WordNet
Authors: Sunny Rai; Amita Jain; Priyank Pandey
Addresses: Department of Computer Engineering, Netaji Subhas Institute of Technology, Delhi, 110078, India; School of Engineering Sciences, Mahindra Ecole Centrale, Hyderabad, 500043, India ' Department of Computer Science and Engineering, Ambedkar Institute of Advanced Communication Technologies and Research, Delhi, 110031, India ' Department of Computer Science and Engineering, Ambedkar Institute of Advanced Communication Technologies and Research, Delhi, 110031, India
Abstract: Lack of competent lexical resources is a ubiquitous fact that negatively affects the development of natural language processing tools for not so widely spoken languages. Recently, projects such as Indo WordNet have significantly reduced the scarcity of lexicons for Indian languages. However, their coverage is still a matter of concern. The cost and time incurred are other limiting factors. The reluctance to automate the process of lexicon generation is majorly credited to the poor precision of the generated synsets. In this paper, we strive to tackle these issues by incorporating language-specific knowledge resources which ensures the authenticity of the generated synsets along with the inclusion of endemic words. We propose a corpus-based approach for automated synset generation which visibly improves the quality of the generated synsets. The experiments performed on a manually created dataset of Hindi words provide a precision of 81.56% and an F-measure of more than 72%.
Keywords: WordNet; lexical database; Indian languages; NLP; natural language processing; SVM; support vector machine; Wikipedia; machine readable lexicon; machine learning; Wiktionary.
DOI: 10.1504/IJTPM.2019.104062
International Journal of Technology, Policy and Management, 2019 Vol.19 No.4, pp.405 - 419
Received: 29 Nov 2017
Accepted: 22 Apr 2018
Published online: 10 Dec 2019 *