Title: BioTopic: a topic-driven biological literature mining system
Authors: Xi Wang; Peiyan Zhu; Tao Liu; Ke Xu
Addresses: State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China ' State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China ' State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China ' State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China
Abstract: Biology and biomedicine are flourishing disciplines, with massive biological data produced in experiments and huge amount of research papers published in journals. In such a big data context, unsupervised data mining methods such as topic models are used to extract topics from large-scale document collections. In this paper, we present a biological literature mining system based on topic modelling (BioTopic). Experiments show that the perplexity reduction percentage of our pre-processing method is 5% larger that of a traditional pre-processing method. The precision of our search performance reaches 86%, which is better that that of a unigram language model. Our method employs linguistic information from shallow parsing to better pre-process biological literature for topic models. BioTopic with fine-grained pre-processing and topic modelling works better than traditional literature mining systems.
Keywords: biological literature; biological topics; topic modelling; topic mining; big data; data mining; shallow parsing; fine-grained pre-processing; bioinformatics.
DOI: 10.1504/IJDMB.2016.075822
International Journal of Data Mining and Bioinformatics, 2016 Vol.14 No.4, pp.373 - 386
Received: 13 Nov 2015
Accepted: 18 Nov 2015
Published online: 06 Apr 2016 *